Pbs_server process dies and resource count wrong

Hi all,

I have a PBS primary/secondary setup running on Rocky Linux 8.8. Recently the pbs_server.bin process on the primary has begun to die after a few hours without any understanble (for me) reason. If look in /var/log/messages I see the following text:

1382106 Jul  9 16:22:57 pbs01 systemd-coredump[81551]: Core file was truncated to 2147483648 bytes.
1382107 Jul  9 16:23:17 pbs01 systemd-coredump[81551]: Process 2185 (pbs_server.bin) of user 0 dumped core.#012#012Stack trace of thread 2185:#012#0  0x00007fd0c9ea5acf n/a (n/a)
1382108 Jul  9 16:23:17 pbs01 systemd[1]: systemd-coredump@0-81550-0.service: Succeeded

the problem is always at #012#012Stack with the same pointer and if I run coredumpctl:

$> coredumpctl debug
...
0  main (argc=1, argv=0x7fffffffd898) at ../../../src/server/pbsd_main.c:632
632		svr_interp_data.data_initialized = 0;
(gdb) where
#0  main (argc=1, argv=0x7fffffffd898) at ../../../src/server/pbsd_main.c:632

the first breakpoint carries me here but I do not know how to go on from here.

In addition I am experiencing another strange behaviour: some nodes have a wrong number of resource i .e. they have times two cpus and gpus; this is fixed by restarting the MoM. I do not know if the problems may be related.

All these problems have started since a few days.

Any help?

Thank you for your time