Ive been testing redhat 9 (loads of rhel8 systems that are not showing this issue) and it seems that every time the PBS server is rebooted the next job that gets submitted gets a job number rounded up to next 1000 starting point.
0,1,2,…502, (reboot) 1000,1001…1300 (reboot) 2000,2001…
the most bizarre database corruption I ever saw if this is accidental…
has anyone else seen oddness on redhat 9? (9.5)
not that its really a problem nor can I just not go in and fix up the sv_jobidnumber whenever the PBS server is booting now that I know there is strangeness.
its just odd.
If there is a abrupt shutdown of the pbs server/datastore , the job id count is incremented by X to avoid job corruption. You might have already checked all these, please check whether there is core dump or space issue or anything related to quota or /var/log/messages or postgres tunables might help. I have not encourntered this issue with your workflow, but only with the wrong failover configuration,
I looked at the code (get_next_svr_sequence_id in src/server/req_quejob.c). My guess is that the rounding is an unintended effect of database refactoring done by commit ce0cb14d0 to support a site-settable max jobid.
so I need to set a max jobid (which I dont do Im thinking) or is this due to the previously suggested bad shutdown and when it comes back it just happens to round up a lot instead of to the next unused jobid?
couldnt find any core dumps from previous shutdowns nor any space issues but I was going to try to more gracefully shut down PBS before the server is rebooted. just havent yet.
if I shut down PBS as a service before rebooting the rounding up the next jobid looks fine again. so I will make sure I do that going forward.
maybe rhel9 isnt thinking about shutting down PBS automagically when going down like rhel8 does.
also Im assuming the
set server max_job_sequence_id = 9999999
is the default max job id as I dont see setting that in any of our build process.
Max possible sequence ID is 12 digits: 999, 999,999,999; cluster administrators can limit the ID by setting the server level attribute 'max_job_sequence_id’.
Please note it is reset back to 0 once the max_job_sequence_id is reached.
The /usr/lib/systemd/system-shutdown/mystop.shutdown way not work.
The Root Cause of this issue is that:
The pbs_init.d startup script launches pbs_data_service and PostgreSQL as child processes that escape systemd’s control group (cgroup). During shutdown, systemd kills these processes in parallel with the main pbs.service, preventing kill database before stop pbs.service, and leadto unsafe stop of pbs.service. One could see the ‘service shutdown err‘ log in `/var/spool/pbs/server_logs/date_numbers`, such as
01/04/2026 09:15:56;0001;Server@mu;Svr;Server@mu;PBS server internal error (15011) in svr_save_db, Failed to save server Execution of Prepared statement update_svr failed: FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request. 57P01
01/04/2026 09:15:56;0001;Server@mu;Svr;Server@mu;panic_stop_db, Panic shutdown of Server on database error. Please check PBS_HOME file system for no space condition.
01/04/2026 09:15:56;0002;Server@mu;Svr;Log;Log closed