We want to be able to flag when a file system goes down and avoid running jobs that need that file system. The plan was to use a server boolean custom resource. It would be true most of the time and we would set it to false when the file system is down. Here is what I tried to do:
# Create and define resource home_fs
create resource home_fs
set resource home_fs type = boolean
set resource home_fs flag = h
qmgr -c "set server resources_available.home_fs=true"
I edited /var/spool/pbs/sched_priv/sched_config and added home_fs to the resources line and then HUPed the scheduler.
I then ran:
qsub -l home_fs=true -l walltime=05:00 -- /usr/bin/hostname
qstat -f showed:
Resource_List.select = 1:home_fs=True:ncpus=1
Submit_arguments = -l home_fs=true -l walltime=05:00 -- /usr/bin/hostname
comment = Can Never Run: Insufficient amount of resource: home_fs (True !=False)
why is home_fs listed in Resource_List.select? I did not do a select, so that should have been a job wide resource, should it not? I assume that is why I am getting the True != False because it isn’t defined on the vnode. We don’t want to have to define it on every node as changing it would then be extremely inconvenient.
The examples in the manual for custom resources, both static and dynamic, were consumable (counts of licenses) and either PBS keeps track of the count (static) or you need a script (dynamic). Can I use a boolean at all? If so, can I just use qmgr to change the value? If not, do I have to treat this as a dynamic resource and write a script that returns true or false?
Thanks for the suggestion. Some additional questions:
Your suggestion made me look at table 5-9 Resource Accumulation Flags on page AG-261. In this case, do I want to set the flag to q or do I want no flag at all? This is a boolean resource. The description of q says it is going to incremented by one and it must be consumable or time based. No flag is that it is not consumable which seems to be a better match?
If I want no flag, how do I clear it? will qmgr -c "set resource home_fs flag = '' " work?
I tried to make the change to a q and this is what I got:
(base) [allcock@edtb-01 20220216-16:57:04]> qmgr -c "set resource home_fs flag = q"
qmgr obj=home_fs svr=default: Resource busy on job
There is no job running or queued and I restarted PBS on the thought that maybe the resource was “stuck”. Same result. Any idea how I “unstick” the resource?
I think it is good idea to use the node health check script ( mom perioidic hook) to periodically check the file system on the nodes and if there is node health issue (file system not accessible or full or mount not available), then set the node offline.
Thanks for your help with this. I did get the resource “unstuck”, though I would love to understand why that happened in the first place. However, it is not working as I had hoped. I thought it was. The value was True, I ran a job and it worked. I set the value to False and it didn’t, but now I set it back to True and queued jobs didn’t run, but neither are new jobs being submitted. I will continue to poke at this.
I believe you are right, you want no flags on the resource. The reason it showed up in the select is because of flag=h and the fact that you didn’t submit a select. When you don’t submit a select, the server will create one for you based on all of the flag=h resources you have requested. If you do submit a select, you can’t submit any flag=h resources as job wide resources.
The reason you got the ‘resource is busy on job’ message is that you can change very little about resource definitions when they are requested by a job in the system (or even in history). This means a submitted/history job, not just running.
You were right when you said that the job didn’t run because the nodes didn’t have it set. A node is considered to have 0 of a resource or will not match (as is the case here) if it isn’t set.
What Adarsh said to do by shutting down the server and changing the resourcedef file will work. It also can have some undesired effects because the server/scheduler isn’t going to expect things like a no-flag resource in a select.
Now onto the current problem at hand. Requesting a boolean resource at the job wide level should work fine. If the job doesn’t run because of it, you’ll see a comment like 'Insufficient resources at server level XXX (True != False)". If you don’t see “server level” or don’t see the “(True != False)” then there is some other reason the job is not running. What is the comment on the jobs that aren’t running?
Once we have the current issue ironed out, something you could consider doing is write a server_dyn_res script for each file system doing a health check on it. This means you won’t have to manually set the filesystem resource as True/False. The server_dyn_res script will do it iself.
FWIW, Where I used to work, we had a similar requirement, but for scratch file systems. This was with a much older version of PBS, and my unreliable memory is that I tried first using server boolean resources, but could not get the behavior we wanted. I switched to consumable resources, where a queuejob hook looked up the default scratch file system for the user and added a ‘-l scratchX=1’ to the job. (If the job already specified some scratchX=Y values, the hook did nothing.) To enable use of file system scratchX, we set the server resources_available.scratchX to a large number (5000). To block starting jobs that requested the file system, we set resources_available.scratchX=0.
This had a few minor advantages over a simple boolean. First, some jobs did not need scratch space, so they could specify scratchX=0 and the job would not be blocked no matter what state scratchX was in. (Specifying a boolean scratchX=false for a job blocks the job until scratchX is down.) Second, the resources_assigned.scratchX values told you how many jobs were (probably) using a given file system. Third, you could set resources_available.scratchX to a small number as a coarse limit on the load to allow on the file system. Fourth, you could create a reservation to start at a specific time that requested all 5000 of the scratchX resources, thus scheduling a dedtime for just that file system. (I’m not sure we ever used this–needs testing.)
Interesting idea. My first thought was “Why would anyone set it to False”, but then it occurred to me they might be thinking they were explicitly saying “I don’t need this file system” which, as you say, would not work as intended.
The how many jobs are using the filesystem might come in handy, though I don’t recall us ever needing to know that. I doubt we would ever use it as a throttle. The reservation idea might be useful for benchmarking, but I could also see a user setting it to 5000 on their own and blocking other users when we didn’t want them to. If we wrote a hook I guess we could overwrite any value greater than zero to be one unless it was a manager or something along those lines. I would also have to think about how that figures into the prioritization calculation.
You found a bug. Nice catch. For server and queue level resources, if a value is unset, the resource should be ignored. This is the case for all resource types other than booleans. It should be true for booleans.
The PBS scheduler is getting a C++ facelift. In the past couple of years we compiled our C code with a C++ compiler. Ever since then, we’ve been updating the scheduler to use C++ constructs. Recently, the resource comparison code was refactored. That’s where this bug slipped in.
Please make the following change in the function find_check_resource() (check.cpp):
res = fres;
Change it to:
if (resreq->type.is_boolean && (flags & UNSET_RES_ZERO))
res = fres;
This should fix your problem.
Either that or use a consumable resource like @dtalcott suggested. It doesn’t run into the bug.
In any case, I’ll file the bug and see about getting it fixed.
Actually another option is to just set the resource at the queue level as well. That’s probably unmanageable since you want one boolean per fileserver.
I don’t understand this part. Restarting the server should have no effect on the bug in question. It should always be the case. The scheduler will think that an unset queue resource is false. If it didn’t run once, it should never run (as the comment said).