How to PBS server and sched graceful shutdown ,and preserving PBS MOMs jobs?

how to PBS server and sched graceful shutdown ,and preserving PBS MOMs jobs?

PBS server primary
PBS server secondary
PBS MOMs

We need to move from GFS2 to ext4 due to problem with gfs2’s fencing of volume /var/spool/pbs on both primary and secondary servers to handle pbs queue management.

So, we will shutdown pbs service by service pbs stop on servers. but we like to jobs to continue to run as there will be no changes at pbs mom. how to do it?

Hi,
I think that You need to run qterm -f command to stop both servers. Look to Reference Guide at description of qterm options.

-f
If the complex is configured for failover, shuts down both the primary and secondary servers.
Without the -f option, qterm shuts down the primary server and makes the secondary server active.
The -f option cannot be used with the -i or -F options.

1 Like

Check the chapter “Starting and Stopping PBS on Linux” in the Installation & Upgrade Guide.

1 Like

The following command shuts down the primary server, the secondary server, the scheduler(s), and all MoMs in the complex. Running jobs and subjobs continue to run:

qterm -s -m -f

-m
Shuts down the primary server and all MoMs (pbs_mom). This option does not cause jobs or subjobs to be killed. Jobs are left running subject to other options to the qterm command.

-s
Shuts down the primary server and the scheduler (pbs_sched).

Please Refer:

IG-164
PBS Professional 2020.1 Installation & Upgrade Guide

PBS Professional 2020.1 Reference Guide
RG-233

@boboshaq @agurban @adarsh Great, appreciate your prompt help with the quick solution , we greatly relieved as we will be only stopping the pbs servers and pbs mom jobs will continue. It’s saves lot of time of already running calculations at clusters.

Now that scheduled downtime over and like to bring back the pbs system. Is this command “service pbs start” will come up gracefully or we need to start the pbs daemons manually so preserved jobs didn’t get restarted.

@adarsh we do not wish to shutdown pbs moms as nodes will be running. if we use qterm -s -m -f then all pbs daemons stopped and running jobs will be unaffected. when we restart pbs service (pbs server and moms) then jobs will be restarted ? what is the safe method to restart pbs without affecting running jobs?

I am not sure whether there is an option to not to stop the pbs_mom’s but perform only on the primary and secondary server.

Please refer the below section and check the ‘-p’ option when starting the pbs_mom
2.23.3 Options to pbs_mom
RG-72 PBS Professional 2021.1 Reference Guide

-p
Specifies that when starting, MoM should allow any running jobs to continue running, and not have them requeued. This option can be used for single-host jobs only; multi-host jobs cannot be preserved. Cannot be used with the -r option. MoM is not the parent of these jobs.

I’m not sure about this, but my recollection is that when we needed to work on the server while jobs were running, we just used “qterm -t quick” on the server host. This left the MoMs and jobs running. When the server was brought back up, it recovered connections to the MoMs. If any jobs had finished while the server was down, the MoMs let the server know.

We didn’t use failover, but it looks as if “qterm -f -t quick” is what to use in that case.

(We also stopped scheduling several minutes ahead of the qterm, just to cut down on activity during the shutdown and after the restart. qmgr -c ‘set server scheduling=false’ )

Hi,

qterm -f -m -s kept jobs runnings. However due to multiple restart OS and pbs service (primary and secondary) for testing cluster node HA might caused jobs halt.

ofcourse it helped us understand the pbs shut down process better with the steps given by you all.

We quickly used this in our graceful cluster shut down(pbs server schedule and moms) as follows
qterm -f -s -t immediate

Yes, qterm -f -t quick will test time and as manual suggest it should work too.

But real challenge is the order to restart pbs? Service pbs moms first and later pbs server primary and secondary?

Thank you.

Regards,
Anilkumar

I am not positive, so please double-check my logic:

  • If you want to stop everything cleanly, use qterm -t immediate -f -m -s. This terminates or requeues all running jobs.

  • If you want to keep jobs running, but work on servers, use qterm -t quick -f. This leaves the MoMs and jobs running. When you want to keep jobs running, it is better to keep the MoMs running also. When you are done with your work, just restart the server/scheduler.

  • If you want to stop the MoMs for some reason, but keep jobs running, you need qterm -t quick -m. However, there is no documented way to restart after this and recover the running jobs. When we wanted to do this, we modified the pbs init.d script to add ‘-p’ to the pbs_mom command line. Note this might not work for multi-node jobs.

In general, I would start the server before starting the MoMs. But, be sure to set node_fail_requeue=0 quickly once the server is responding. This way, if you are slow getting the MoMs back up, the server won’t requeue their jobs.

Depending on how long you expect the server to be down, you might consider suspending all running jobs during the dedicated time and resuming them once the server is back up. This protects the jobs from network or server glitches during the dedicated time. There are some tricks to this and we have local mods to make it go more smoothly.

@dtalcott Thanks for useful information.

Please see section 8.5.2 in the Installation & Upgade Guide, on page IG-169, for instructions on stopping and restarting PBS while preserving running jobs.