PBS server core dump on apparent job delete

Hi,

We are running version 23.06.06 of OpenPBS and the server core dumped on an apparent request to delete a job which was then rejected. Looks like the job was already deleted perhaps.

This was the error in the server log just before the core…

11/15/2023 09:13:30;0080;Server@pbssrv1;Job;3512920;delete job request received

11/15/2023 09:13:30;0008;Server@pbssrv1;Job;3512920.pbssrv1;Job to be deleted at request of pbsuser@headnode

11/15/2023 09:13:30;0008;Server@pbssrv1;Job;3512922.pbssrv1;Job to be deleted at request of pbsuser@headnode

11/15/2023 09:13:30;0008;Server@k-admin;Job;3512917.pbssrv1;Job to be deleted at request of pbsuser@headnode

11/15/2023 09:13:30;0008;Server@pbssrv1;Job;3512924.pbssrv1;Job to be deleted at request of pbsuser@headnode

11/15/2023 09:13:30;0080;Server@pbssrv1;Job;B<9D>#;Unknown Job Id

11/15/2023 09:13:30;0080;Server@pbssrv1;Svr;update_deljob_rply;job B<9D># has already been deleted from delete job list

11/15/2023 09:13:30;0080;Server@pbssrv1;Req;req_reject;Reject reply code=15001, aux=0, type=100, from pbssrv1@headnode

Here is a backtrace from the core dump (let me know if you would like the actual core file and how I can upload it to you guys):

warning: Section `.reg-xstate/7281’ in core file too small.

[Thread debugging using libthread_db enabled]

Using host libthread_db library “/lib64/libthread_db.so.1”.

Core was generated by `/usr/local/pkgs/openpbs/sbin/pbs_server.bin’.

Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/7281’ in core file too small.

#0 0x00007f9ec428cb8f in raise () from /lib64/libc.so.6

[Current thread is 1 (Thread 0x7f9ec7047840 (LWP 7281))]

Missing separate debuginfos, use: yum debuginfo-install cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 expat-2.2.5-11.0.1.el8.x86_64 glibc-2.28-225.0.4.el8_8.6.x86_64 gssproxy-0.8.0-21.el8.x86_64 keyutils-libs-1.5.10-9.el8.x86_64 krb5-libs-1.18.2-25.0.1.el8_8.x86_64 libblkid-2.32.1-42.el8_8.x86_64 libcom_err-1.45.6-5.el8.x86_64 libgcc-8.5.0-18.0.6.el8.x86_64 libical-3.0.3-3.el8.x86_64 libicu-60.3-2.el8_1.x86_64 libmount-2.32.1-42.el8_8.x86_64 libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 libpq-13.5-1.el8.x86_64 libselinux-2.9-8.el8.x86_64 libstdc+±8.5.0-18.0.6.el8.x86_64 libtirpc-1.1.4-8.el8.x86_64 libxcrypt-4.1.1-6.el8.x86_64 nss_nis-3.0-8.el8.x86_64 openldap-2.4.46-18.el8.x86_64 openssl-libs-1.1.1k-9.el8_7.x86_64 pcre2-10.32-3.el8_6.x86_64 python3-libs-3.6.8-51.0.1.el8_8.2.x86_64 systemd-libs-239-74.0.6.el8_8.5.x86_64 zlib-1.2.11-21.el8_7.x86_64

(gdb) bt

#0 0x00007f9ec428cb8f in raise () from /lib64/libc.so.6

#1 0x00007f9ec425fea5 in abort () from /lib64/libc.so.6

#2 0x00007f9ec42cdda7 in __libc_message () from /lib64/libc.so.6

#3 0x00007f9ec42d509c in malloc_printerr () from /lib64/libc.so.6

#4 0x00007f9ec42d6c24 in _int_free () from /lib64/libc.so.6

#5 0x00000000004b063d in free_string_array (arr=0x239f66e0) at misc_utils.c:1346

#6 0x00000000004b1bba in free_string_array (arr=) at misc_utils.c:1344

#7 0x0000000000455665 in free_br (preq=0x239f6d50) at process_request.c:1658

#8 0x0000000000457abb in reply_send (request=request@entry=0x239f6d50) at reply_send.c:369

#9 0x0000000000457ec3 in req_reject (code=15001, aux=aux@entry=0, preq=preq@entry=0x239f6d50) at reply_send.c:536

#10 0x0000000000459bba in req_deletejob (preq=0x239f6d50) at req_delete.c:625

#11 0x0000000000455844 in process_request (sfds=20) at process_request.c:720

#12 0x00000000004c1dee in process_socket (sock=sock@entry=20) at net_server.c:510

#13 0x00000000004c1fca in wait_request (waittime=, priority_context=) at net_server.c:623

#14 0x000000000042749e in main (argc=, argv=0x7ffcc65460e8) at pbsd_main.c:1398

Looks like a classic use-after-free (the bogus job id is a give-away). There are some clues. What is server k-admin which logged job 3512917 vs. pbssrv1 which logged the others? What is interesting about the four jobs mentioned? Why were they deleted? Were any of them running or were they all still queued? You show a “delete job request received” message for only job 3512920. Were there similar messages for the other jobs? What was the status of the four jobs when the server was restarted?

Have you tried to reproduce the issue? Has it happened more than once?

k-admin and pbssrv1 are the same system, you can ignore that.
I am not sure why the jobs were deleted by the user.
All 4 jobs were deleted by the same user and none of them had ever started running.

Here are the delete requests in the log, but they were not successful as they got requeued after pbs was restart:

11/15/2023 09:13:30;0080;Server@pbssrv1;Job;3512920;delete job request received
11/15/2023 09:13:30;0008;Server@pbssrv1;Job;3512920.pbssrv1;Job to be deleted at request of pbsuser@headnode
11/15/2023 09:13:30;0008;Server@pbssrv1;Job;3512922.pbssrv1;Job to be deleted at request of pbsuser@headnode
11/15/2023 09:13:30;0008;Server@pbssrv1;Job;3512917.pbssrv1;Job to be deleted at request of pbsuser@headnode
11/15/2023 09:13:30;0008;Server@pbssrv1;Job;3512924.pbssrv1;Job to be deleted at request of pbsuser@headnode
11/15/2023 09:25:25;0086;Server@pbssrv1;Job;3512917.pbssrv1;Requeueing job, substate: 91 Requeued in queue:
11/15/2023 09:25:25;0086;Server@pbssrv1;Job;3512920.pbssrv1;Requeueing job, substate: 91 Requeued in queue:
11/15/2023 09:25:25;0086;Server@pbssrv1;Job;3512922.pbssrv1;Requeueing job, substate: 91 Requeued in queue:
11/15/2023 09:25:25;0086;Server@pbssrv1;Job;3512924.pbssrv1;Requeueing job, substate: 91 Requeued in queue:

The issue has only happened once. Lots of jobs have been queued, run and/or deleting since we upgraded pbs to 23.06.06 2.5 weeks ago, over 20,700 jobs to be exact. I have made an attempt to reproduce the issue by queuing several jobs and then deleting them but have not been successful in causing the server to crash. Seem like it probably has to be just the right moment and under particular conditions.

Do you have commit b1249644 installed? Here is the description:

commit b1249644259f24c2d01018676fa2b2f4c842b856
Author: Eric Pershey <ericpershey@users.noreply.github.com>
Date:   Wed Aug 23 10:17:36 2023 -0500

    qdel 1234 1234 -> server core dump (#2552)
    
    * adding 3 tests to expose the qdel of the same job segmentation fault.
    
...
    
    * fixes to eliminate seg faults, double frees and usage after free when
    deleting duplicate subjobs and subjob ranges
    
    ---------
    
    Co-authored-by: Brian R. Toonen <toonen@anl.gov>

Thanks Dale! That does look like the issue. I will give that patch a try.

I have run into a similar issue again (with commit b1249644 installed), but so far I am not able to reproduce it by deleting the same job multiple times. Maybe it is a specific race condition? Unfortunately the core file was not complete.

Messages file:
Apr 1 21:20:26 pbs-admin systemd[1]: Started Process Core Dump (PID 1235909/UID 0).
Apr 1 21:23:36 pbs-admin systemd-coredump[1235910]: Core file was truncated to 2147483648 bytes.
Apr 1 21:23:36 pbs-admin abrt-dump-journal-core[3304]: Failed to obtain all required information from journald
Apr 1 21:23:46 pbs-admin systemd-coredump[1235910]: Process 3338446 (pbs_server.bin) of user 0 dumped core.#012#012Stack trace of thread 3338446:#012#0 0x00007fbdd35beb8f n/a (n/a)
Apr 1 21:23:46 pbs-admin systemd[1]: systemd-coredump@4-1235909-0.service: Succeeded.
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘ccpp-2023-11-15-09:13:48.290787-7281’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘ccpp-2023-11-01-03:54:28.458711-1601291’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘ccpp-2023-11-01-04:40:56.708933-1622604’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘Python3-2023-11-08-07:55:59-183456’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘Python3-2023-12-08-11:16:35-3615085’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘Python3-2023-12-08-11:21:02-3617674’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘ccpp-2023-11-01-03:41:47.208627-1588051’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘ccpp-2023-03-27-20:37:23.642881-4699’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘Python3-2023-06-16-07:47:28-4115868’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘Python3-2023-11-09-02:40:15-951460’
Apr 1 21:23:48 pbs-admin abrtd[1959]: Size of ‘/var/spool/abrt’ >= 5000 MB (MaxCrashReportsSize), deleting old directory ‘oops-2023-10-12-15:42:02-3197-0’
Apr 1 21:23:48 pbs-admin abrt-server[1236180]: Error: No segments found in coredump ‘./coredump’
Apr 1 21:23:48 pbs-admin abrt-server[1236180]: Can’t open file ‘core_backtrace’ for reading: No such file or directory
Apr 1 21:23:52 pbs-admin dbus-daemon[1952]: [system] Activating service name=‘org.freedesktop.problems’ requested by ‘:1.59089’ (uid=0 pid=1236228 comm=“/usr/libexec/platform-python /usr/bin/abrt-action-”) (using servicehelper)

PBS log (this is a similar problem to the issue this past fall):
04/01/2024 21:20:24;0040;Server@pbs-admin;Svr;pbssrv1;Scheduler sent command 3
04/01/2024 21:20:25;0008;Server@pbs-admin;Job;3944077.pbssrv1;Job Modified at request of Scheduler@pbs-admin
04/01/2024 21:20:26;0080;Server@pbs-admin;Job;3944025.pbssrv1;delete job request received
04/01/2024 21:20:26;0008;Server@pbs-admin;Job;3944025.pbssrv1;Job to be deleted at request of user4@login4
04/01/2024 21:20:26;0080;Server@pbs-admin;Job;3944025.pbssrv1;Also deleted job history
04/01/2024 21:20:26;0080;Server@pbs-admin;Job;,V;Unknown Job Id
04/01/2024 21:20:26;0080;Server@pbs-admin;Svr;update_deljob_rply;job ,V has already been deleted from delete job list
04/01/2024 21:20:26;0080;Server@pbs-admin;Req;req_reject;Reject reply code=15001, aux=0, type=100, from user4@login4

Core file unfortunately incomplete:
[root@pbs-admin coredump]# gdb -c ‘core.pbs_server\x2ebin.0.b0faa95e3fcd4fd2a8cf1a4387a7c982.3338446.1712020826000000’ /usr/local/pkgs/openpbs/sbin/pbs_server.bin
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.0.1.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgnu.org%2Flicenses%2Fgpl.html&data=05|02|jim.matthews%40nasa.gov|8a433b638f354eba86b208dc539501a9|7005d45845be48ae8140d43da96dd17b|0|0|638477147250978603|Unknown|TWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D|40000|||&sdata=anbwp9WrPoV1%2B%2BRg7T2Uf2qapld4R9qEkAEaIwS13f0%3D&reserved=0
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type “show copying” and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gnu.org%2Fsoftware%2Fgdb%2Fbugs%2F&data=05|02|jim.matthews%40nasa.gov|8a433b638f354eba86b208dc539501a9|7005d45845be48ae8140d43da96dd17b|0|0|638477147250987034|Unknown|TWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D|40000|||&sdata=kL8gCnbVLnmMPc4TrYZMoLjh1iktAphAuDOBMBElmfw%3D&reserved=0.
Find the GDB manual and other documentation resources online at:
https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gnu.org%2Fsoftware%2Fgdb%2Fdocumentation%2F&data=05|02|jim.matthews%40nasa.gov|8a433b638f354eba86b208dc539501a9|7005d45845be48ae8140d43da96dd17b|0|0|638477147250993100|Unknown|TWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D|40000|||&sdata=C54fslPdmGOkcCTdP7LWTsG%2BVt9UoT3uOnb%2Fwl9XWR8%3D&reserved=0.

For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from /usr/local/pkgs/openpbs/sbin/pbs_server.bin…done.
BFD: warning: /var/lib/systemd/coredump/core.pbs_server\x2ebin.0.b0faa95e3fcd4fd2a8cf1a4387a7c982.3338446.1712020826000000 is truncated: expected core file size >= 2446700544, found: 2147483648
[New LWP 3338446]
[New LWP 3338447]

warning: Section `.reg-xstate/3338446’ in core file too small.

warning: Error reading shared library list entry at 0x735f636f6c657200
Failed to read a valid object file image from memory.
Core was generated by `/usr/local/pkgs/openpbs/sbin/pbs_server.bin’.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/3338446’ in core file too small.
#0 0x00007fbdd35beb8f in ?? ()
[Current thread is 1 (LWP 3338446)]
(gdb) bt
#0 0x00007fbdd35beb8f in ?? ()
Backtrace stopped: Cannot access memory at address 0x7ffc6fa62740
(gdb)

Thanks,

–Jim