When I start/restart the PBS service on the server/scheduler, it attempts to requeue hundreds of thousands of jobs to the point that it takes over 30 minutes for the service to start (currently over 500,000 jobs 25 min in). It could be looping through the jobs (hitting each job more than once).
Here is an excerpt from the server_logs:
08/11/2022 07:54:44;0100;Server@cfe1;Job;143996.cfe1;enqueuing into normal, state 9 hop 1
08/11/2022 07:54:44;0086;Server@cfe1;Job;143996.cfe1;Requeueing job, substate: 92 Requeued in queue: normal
08/11/2022 07:54:44;0100;Server@cfe1;Job;143997.cfe1;enqueuing into normal, state 9 hop 1
08/11/2022 07:54:44;0086;Server@cfe1;Job;143997.cfe1;Requeueing job, substate: 92 Requeued in queue: normal
08/11/2022 07:54:44;0100;Server@cfe1;Job;143998.cfe1;enqueuing into normal, state 9 hop 1
08/11/2022 07:54:44;0086;Server@cfe1;Job;143998.cfe1;Requeueing job, substate: 92 Requeued in queue: normal
Any thoughts would be greatly appreciated. These are all old jobs and can be deleted if I knew where they existed (they are not in qstat).
When the server is up and running , you can unset the history and set it again. The jobs might have been in the history ( qstat -xH)
You can delete these jobs from the PBS datastore (postgress database)
Please state the version of PBS Pro you are using, the forum members might share their experiences if they had similar issues with that version or a bug was fixed etc
I will try to unset and reset the history setting and try restart again (as soon as the server starts again…)
I am not familiar with interacting directly with the PBS datastore (although I am familiar with postgres). Any documentation that you could point me to would be helpful.
I am currently running OpenPBS 20.0.1 on Rocky Linux 8.6.