Redhat 9 and PBS server reboot causing "next job id" to increase

steveheistand · April 15, 2025, 2:37pm

Ive been testing redhat 9 (loads of rhel8 systems that are not showing this issue) and it seems that every time the PBS server is rebooted the next job that gets submitted gets a job number rounded up to next 1000 starting point.
0,1,2,…502, (reboot) 1000,1001…1300 (reboot) 2000,2001…

the most bizarre database corruption I ever saw if this is accidental…

has anyone else seen oddness on redhat 9? (9.5)

not that its really a problem nor can I just not go in and fix up the sv_jobidnumber whenever the PBS server is booting now that I know there is strangeness.
its just odd.

thanks
s

adarsh · April 15, 2025, 6:11pm

If there is a abrupt shutdown of the pbs server/datastore , the job id count is incremented by X to avoid job corruption. You might have already checked all these, please check whether there is core dump or space issue or anything related to quota or /var/log/messages or postgres tunables might help. I have not encourntered this issue with your workflow, but only with the wrong failover configuration,

dtalcott · April 15, 2025, 7:34pm

I looked at the code (get_next_svr_sequence_id in src/server/req_quejob.c). My guess is that the rounding is an unintended effect of database refactoring done by commit ce0cb14d0 to support a site-settable max jobid.

steveheistand · April 15, 2025, 7:50pm

so I need to set a max jobid (which I dont do Im thinking) or is this due to the previously suggested bad shutdown and when it comes back it just happens to round up a lot instead of to the next unused jobid?

thanks
s

steveheistand · April 15, 2025, 7:51pm

couldnt find any core dumps from previous shutdowns nor any space issues but I was going to try to more gracefully shut down PBS before the server is rebooted. just havent yet.

thanks
s

steveheistand · April 15, 2025, 9:15pm

if I shut down PBS as a service before rebooting the rounding up the next jobid looks fine again. so I will make sure I do that going forward.
maybe rhel9 isnt thinking about shutting down PBS automagically when going down like rhel8 does.

also Im assuming the
set server max_job_sequence_id = 9999999
is the default max job id as I dont see setting that in any of our build process.

thanks
s

adarsh · April 15, 2025, 11:14pm

Thank you for the above information.

Max possible sequence ID is 12 digits: 999, 999,999,999; cluster administrators can limit the ID by setting the server level attribute 'max_job_sequence_id’.

Please note it is reset back to 0 once the max_job_sequence_id is reached.

berlin2123 · May 28, 2025, 8:39am

It’s solved by adding a script which will auto-run before shutdown.

Add a executable file /usr/lib/systemd/system-shutdown/mystop.shutdown

#!/usr/bin/sh
# We need to ensure stop service or other jobs to finish
# before the shutdown.


/usr/bin/systemctl stop pbs.service

/usr/bin/sleep 10

Topic		Replies	Views
OpenPBS database corrupted Users/Site Administrators	4	47	April 16, 2025
Each PBS restart created a jump in the JOBID Users/Site Administrators	2	623	May 13, 2019
Any way to update the "last used" jobid? Users/Site Administrators	1	392	July 28, 2021
PP-289: unique job ids up to 1 trillion Developers	59	5441	August 27, 2018
Jobid namespace resolution for multi-server Developers	36	2075	January 26, 2021

Redhat 9 and PBS server reboot causing "next job id" to increase

Related topics