I am looking for a way to run OS Update Jobs in our environment. Before I state my question I will further explain what I am going to achieve:
On a regular basis, every 4 weeks, we bring our Linux machines (CentOs) to the latest software state. For this we basically run a yum update & reboot with some scripts around this to do some extra checks and settings. The work-path for this so far is.
-> Close the queue, wait for the nodes to be empty, execute the update on the node, reboot, open the queue.
This was ok for now, however the environment got bigger and the nodes more utilized, I would like to optimize and automatize that, to be a fire and forget thing.
There are several ways to takle this. One way would be to run an ansible kind of script. This has the drawback that ansible does not now about the node & job state and this would have to be scripted. So I was looking for sth. more elegant. I studied the PBS Big Book and came over the provisioning section.
So my idea is the following: define 2 resource states, e.g patched & unpatched. If there is a new OS update, set all the nodes to “unpatched” and run a provisioning script which Is asking for a node in state “patched”. In this approach, there is from my point of view the major advantage that the scheduler is taking care of running the update. So there this no need to close the queue or wait for nodes to be empty. Once the provisioning job is submitted, even if the nodes are busy, its just postponed until there are free nodes and the environment is updated. (The updates are tested beforehand on a subset of nodes, to avoid problems)
So now to my question: So far I can set the resources states and run a provision. However, I need a mechanism that allows me to run (in the simplest case) a yum update on the compute node as part of the script and this needs to be done as “root”. I was looking on the PBS Hooks but I am struggeling to understand how to implement such a hook.
Apart from my question, I am happy to hear how you run updates in your environment and maybe there are better solutions.
Thanks a lot …
A couple things to consider…
Take a look at this discussion: Allowing to schedule node maintenance with a possibility to run new jobs until the maintenance begins
It led to this design document: https://pbspro.atlassian.net/wiki/spaces/PD/pages/879493121/Node+maintenance+window+enhancement
And ultimately led to the introduction of maintenance reservations, thanks to @vchlum.
In terms of running yum as root, I suggest you update your sudo configuration so that one of your user accounts may run “yum update” on your nodes. The update itself may be run from within a job. Since the node will reboot, you won’t want the job to be rerun-able.
@sak there is a site that does pretty much this. They use diskless nodes, so they create a new mom node image. When they are ready to update, they will set a default_chunk.aoe=new_aoe. New jobs will pick this new aoe up. When that job runs, it will run the provisioning script which will set the node to use the new image and reboot it. When it comes up, it is ready.
Your case is a bit more difficult. They get away with running a command as root on a remote host because they create their new images ahead of time. You’ll need to be able to run a command as root on a remote host from the server machine. The only way I know how to do this is through passwordless ssh (or even less secure methods). This probably isn’t as secure as you want it to be.
Might you consider moving to some sort of diskless model?
@bhroam & @mkaro thanks a lot for your quick reply.
@bhroam thanks for your reply. Yes we (admins) would like to have a diskless scenario, however customer is not convinced ;-). So this is not a option.
@mkaro thanks a lot for pointing me to this discussion. If I understand pbs_rsub correctly I need to plan ahead based on the assumption of how long the longest job will run. This is not optimal, since I would like to see a solution where a busy cluster gets an update job and updates itself once there are free nodes available. Anyways this is a nice option and surly fits my needs for now. Thank you.
You were writing “The update itself may be run from within a job”. How would such a job look like.
First, add yourself as a manager in qmgr if you are not already. Assuming your account is foo…
qmgr -c 'set server managers += foo@*'
Then use pbs_rsub to submit your maintenance reservation…
pbs_rsub --hosts exechost002 -R1300 -D20
That will create a queue prefixed by the letter “M” to which you may submit jobs. Configure sudo so that your account may run “yum update -y” and “shutdown -r now” as root. Then submit a job that runs “yum update -y” and reboots the node. Pass the “-r n” parameter to qsub to specify that the job is not rerunable. You can specify this in a #PBS directive in the job script itself.
Obviously, you should pick a couple nodes and test this out before you try it across your entire cluster. But that provides the basic steps required.
great thanks, I’ll try this out. That should do it …