Hi all,
I have a PBS primary/secondary setup running on Rocky Linux 8.8. Recently the pbs_server.bin
process on the primary has begun to die after a few hours without any understanble (for me) reason. If look in /var/log/messages
I see the following text:
1382106 Jul 9 16:22:57 pbs01 systemd-coredump[81551]: Core file was truncated to 2147483648 bytes.
1382107 Jul 9 16:23:17 pbs01 systemd-coredump[81551]: Process 2185 (pbs_server.bin) of user 0 dumped core.#012#012Stack trace of thread 2185:#012#0 0x00007fd0c9ea5acf n/a (n/a)
1382108 Jul 9 16:23:17 pbs01 systemd[1]: systemd-coredump@0-81550-0.service: Succeeded
the problem is always at #012#012Stack
with the same pointer and if I run coredumpctl
:
$> coredumpctl debug
...
0 main (argc=1, argv=0x7fffffffd898) at ../../../src/server/pbsd_main.c:632
632 svr_interp_data.data_initialized = 0;
(gdb) where
#0 main (argc=1, argv=0x7fffffffd898) at ../../../src/server/pbsd_main.c:632
the first breakpoint carries me here but I do not know how to go on from here.
In addition I am experiencing another strange behaviour: some nodes have a wrong number of resource i .e. they have times two cpus and gpus; this is fixed by restarting the MoM. I do not know if the problems may be related.
All these problems have started since a few days.
Any help?
Thank you for your time