Is there a non PDF version of the documentation? It’s quite hard to search for answers to common problems when the first page of search results are a list of very long PDFs.
Even within the PDF, searching it can be difficult due to their length.
For eg: how to make a reservation global?
I might be wrong, but I don’t there’s a non-pdf version of documentation to learn about the various attributes and features in PBS. I think an ‘FAQ’ page would be useful to maintain, we can try that out.
Anyways, coming to your question, I’m not sure what you mean by global, do you want a reservation that reserves all nodes in the cluster?
The main issue with PDFs is that their search isn’t smart. DuckDuckGo and Google do smart search. When I search for a term, I get a relevancy rating and ordering - pdfs just give me a long list of matches but no indication which is most relevant.
Re global reservation - am I the only one that’s experiencing this?? - we need to stop all jobs on the cluster so we can upgrade from 19.1.1 to 19.1.3. The 19.1.2 upgrade notes explicitly state:
** Performing an overlay upgrade between 19.1.1 and 19.1.2 will result in the pbs_mom losing track of running job processes, causing the jobs to appear to run forever to PBS Pro. You must not upgrade from 19.1.1 to 19.1.2 while any PBS jobs are running.
While it also states
Upgrades from 19.1.2 onward will not be impacted by this
It doesn’t explicitly say “upgrades from x<19.1.2 to x=>19.1.3 can ignore this warning” so I am planning to have a maintenance shut down of the whole system.
There are several ways to accomplish what you want. I think the most direct way is to just set all your nodes to state=offline. This will let all running jobs complete and PBS will not start any new jobs.
The “global” reservation option you are thinking of would require you to craft a select statement that requests all nodes on your system and use pbs_rsub to submit the reservation. This will only be accepted if all jobs have finished before the reservation starts. PBS will only allow new jobs to start if they will finish before the reservation. This is somewhat tricky since if you have any nodes down now, you can’t include them in your reservation. We have since fixed this issue with something called a maintenance reservation that can be confirmed on any node regardless of state. This feature is not in 19.1.
The last option is to use dedicated time. You choose a time in the future after all jobs are finished and set dedicated time. Jobs will run if they do not pass over dedicated time. For this to work, jobs require a walltime requested.
I would suggest just offlining all your nodes. It’s the most simple and direct way to do what you want. It’s a little more wasteful since you can’t use the CPU cycles of nodes that end early for short jobs as you drain your system.
Interesting - thank you. Turns out we only have one queue, called Submission. Apparently it’s a routing queue. From there, all jobs are processed to have a walltime (1 hour or other) and cpu count (1 or other) and are distributed into a set of execution queues depending on department.
Which makes me think that dedicated time is the best way forward - the offline idea is so simple I’m a little embarrassed I didn’t think of it. Unfortunately for us, we have some queues with 700hr and 200 hr walltimes. Taking all machines offline 700hrs out will make us very unpopular with the 1-24 hour job mob.
Dedicated time doesn’t actually require a dedicated queue. Having a dedicated queue just allows you to run jobs while in dedicated time.
For dedicated time to work, you will have to determine when the longest currently running job will end. You can schedule dedicated time then. The scheduler will not stop you from scheduling dedicated time while jobs are still running. If that happens, dedicated time will start with running jobs. You will still have to wait for them to finish to perform your upgrade.
Let’s forget about dedicated queues - I understand their function and have no need for them. In my use case I want to make entire cluster accessible until the last minute, but inaccessible after that.
I have this entry in dedicated_time
# Maintenance window for upgrades
11/28/2019 09:00 11/29/2019 09:00
I am hoping this means that no jobs will run in that 24 hours window, and all jobs that run into this window will remain queued until after the maintenance is over.
@datakid you are correct. Jobs will if they won’t cross into dedicated time. Jobs will not run during dedicated time. They will start running again after dedicated time.
As a suggestion, take a look at shrink to fit jobs and see if they fit your site. They will maximize the jobs that will run up against a dedicated time boundary. They aren’t for everyone though.