Pbs crash on primary server

Hi all,

yesterday at about 2030 PM my primary PBS server VM dumped the following in /var/log/messages:

> 
> Jun 12 20:25:33 pbs01 systemd[1]: Started Process Core Dump (PID 211937/UID 0).
> Jun 12 20:25:33 pbs01 systemd-coredump[211938]: Process 7429 (pbs_server.bin) of user 0 dumped core.#012#012Stack trace of thread 7429:#012#0  0x00007f2aecb31acf raise (libc.so.6)#012#1  0x00007f2aecb04ea5 abo
> rt (libc.so.6)#012#2  0x00007f2aecb72cd7 __libc_message (libc.so.6)#012#3  0x00007f2aecb79fdc malloc_printerr (libc.so.6)#012#4  0x00007f2aecb7a3fc munmap_chunk (libc.so.6)#012#5  0x00000000004b285d free_strin
> g_array.part.2 (pbs_server.bin)#012#6  0x0000000000457a65 free_br (pbs_server.bin)#012#7  0x0000000000459e4b reply_send (pbs_server.bin)#012#8  0x000000000045a253 req_reject (pbs_server.bin)#012#9  0x000000000
> 045bf4a req_deletejob (pbs_server.bin)#012#10 0x0000000000457c44 process_request (pbs_server.bin)#012#11 0x00000000004c3b4e process_socket (pbs_server.bin)#012#12 0x00000000004c3d2a wait_request (pbs_server.bi
> n)#012#13 0x0000000000429eed main (pbs_server.bin)#012#14 0x00007f2aecb1dd85 __libc_start_main (libc.so.6)#012#15 0x000000000042ac7e _start (pbs_server.bin)#012#012Stack trace of thread 7431:#012#0  0x00007f2a
> ecc11d3e epoll_pwait (libc.so.6)#012#1  0x00000000004a2cda work (pbs_server.bin)#012#2  0x00007f2aedfe11ca start_thread (libpthread.so.0)#012#3  0x00007f2aecb1ce73 __clone (libc.so.6)

after that event the PBS server stopped to log in pbs_spool; the secondary server came up but for some reasons the takeover was not complete; I was not able to change settings in the server with qmgr (err 15007) until I gave on the secondary

pbs_server -F -1

messages on the secondary shows

Jun 12 20:27:15 pbs02 systemd[154171]: Startup finished in 32ms.
Jun 12 20:27:15 pbs02 systemd[1]: Started Session c13 of user postgres.
Jun 12 20:27:15 pbs02 systemd[1]: session-c13.scope: Succeeded.
Jun 12 20:27:15 pbs02 su[154231]: (to postgres) root on none
Jun 12 20:27:15 pbs02 systemd[1]: Started Session c14 of user postgres.
Jun 12 20:27:16 pbs02 systemd[1]: session-c14.scope: Succeeded.
Jun 12 20:27:17 pbs02 su[154301]: (to postgres) root on none
Jun 12 20:27:17 pbs02 systemd[1]: Started Session c15 of user postgres.
Jun 12 20:27:17 pbs02 systemd[1]: session-c15.scope: Succeeded.
Jun 12 20:27:27 pbs02 systemd[1]: Stopping User Manager for UID 26...

the OS is Rocky Linux 8.8 and PBS version is 22.05.1. My questions are:

  • what type of error is this?
  • if I want to revert back to primary active it is ok to do systemctl restat pbs on the primary (I know I will lose the jobs)
  • it may be related to some lacking resource (CPU, Ram, hanging of shares)?

best regards

Hi all,

some additional information:

  • nfs lock is not running on either pbs server; could be this? They are both NFS clients of an third machine
  • same for nfs-idmap
  • spool is mounted as:
nfs rw,sync,soft 0 0

Please follow this dicussion, might help

Hi Adarsh, thank you.

Is pretty much what I have done. I am confident that is not a network or naming issue: I hadno problems in resolving hostnames under any circumstances and, even if partially, the fail over process was carried out.

I am more uncertain about locks. I used to work with NFS v. 3 and do more or less what is described in the linked discussion (start the related services); however now I have NFS v. 4. As far as I understand there is no more need of rpc-statd, nfs-idmapd and nfs-lock as separated services but I am unsure about how to check for the actual presence of locks. Will keep you posted.

Anyway this may explain the incomplete fail over but I do not think is related to the reason for the crash. I tought it could be a resource problem. From the number of users and queue limits I did an estimate of 50 kB for the maximum number of jobs (the AG says 20) and it should not be a memory issue. Also I did not notice any relevant load on the primary server or disk issues. Do you have any other suggestions?

best,