Hi all,
yesterday at about 2030 PM my primary PBS server VM dumped the following in /var/log/messages
:
>
> Jun 12 20:25:33 pbs01 systemd[1]: Started Process Core Dump (PID 211937/UID 0).
> Jun 12 20:25:33 pbs01 systemd-coredump[211938]: Process 7429 (pbs_server.bin) of user 0 dumped core.#012#012Stack trace of thread 7429:#012#0 0x00007f2aecb31acf raise (libc.so.6)#012#1 0x00007f2aecb04ea5 abo
> rt (libc.so.6)#012#2 0x00007f2aecb72cd7 __libc_message (libc.so.6)#012#3 0x00007f2aecb79fdc malloc_printerr (libc.so.6)#012#4 0x00007f2aecb7a3fc munmap_chunk (libc.so.6)#012#5 0x00000000004b285d free_strin
> g_array.part.2 (pbs_server.bin)#012#6 0x0000000000457a65 free_br (pbs_server.bin)#012#7 0x0000000000459e4b reply_send (pbs_server.bin)#012#8 0x000000000045a253 req_reject (pbs_server.bin)#012#9 0x000000000
> 045bf4a req_deletejob (pbs_server.bin)#012#10 0x0000000000457c44 process_request (pbs_server.bin)#012#11 0x00000000004c3b4e process_socket (pbs_server.bin)#012#12 0x00000000004c3d2a wait_request (pbs_server.bi
> n)#012#13 0x0000000000429eed main (pbs_server.bin)#012#14 0x00007f2aecb1dd85 __libc_start_main (libc.so.6)#012#15 0x000000000042ac7e _start (pbs_server.bin)#012#012Stack trace of thread 7431:#012#0 0x00007f2a
> ecc11d3e epoll_pwait (libc.so.6)#012#1 0x00000000004a2cda work (pbs_server.bin)#012#2 0x00007f2aedfe11ca start_thread (libpthread.so.0)#012#3 0x00007f2aecb1ce73 __clone (libc.so.6)
after that event the PBS server stopped to log in pbs_spool; the secondary server came up but for some reasons the takeover was not complete; I was not able to change settings in the server with qmgr
(err 15007
) until I gave on the secondary
pbs_server -F -1
messages on the secondary shows
Jun 12 20:27:15 pbs02 systemd[154171]: Startup finished in 32ms.
Jun 12 20:27:15 pbs02 systemd[1]: Started Session c13 of user postgres.
Jun 12 20:27:15 pbs02 systemd[1]: session-c13.scope: Succeeded.
Jun 12 20:27:15 pbs02 su[154231]: (to postgres) root on none
Jun 12 20:27:15 pbs02 systemd[1]: Started Session c14 of user postgres.
Jun 12 20:27:16 pbs02 systemd[1]: session-c14.scope: Succeeded.
Jun 12 20:27:17 pbs02 su[154301]: (to postgres) root on none
Jun 12 20:27:17 pbs02 systemd[1]: Started Session c15 of user postgres.
Jun 12 20:27:17 pbs02 systemd[1]: session-c15.scope: Succeeded.
Jun 12 20:27:27 pbs02 systemd[1]: Stopping User Manager for UID 26...
the OS is Rocky Linux 8.8 and PBS version is 22.05.1. My questions are:
- what type of error is this?
- if I want to revert back to primary active it is ok to do systemctl restat pbs on the primary (I know I will lose the jobs)
- it may be related to some lacking resource (CPU, Ram, hanging of shares)?
best regards