We want to be able to flag when a file system goes down and avoid running jobs that need that file system. The plan was to use a server boolean custom resource. It would be true most of the time and we would set it to false when the file system is down. Here is what I tried to do:
#
# Create and define resource home_fs
#
create resource home_fs
set resource home_fs type = boolean
set resource home_fs flag = h
qmgr -c "set server resources_available.home_fs=true"
I edited /var/spool/pbs/sched_priv/sched_config and added home_fs to the resources line and then HUPed the scheduler.
I then ran:
qsub -l home_fs=true -l walltime=05:00 -- /usr/bin/hostname
qstat -f showed:
Resource_List.select = 1:home_fs=True:ncpus=1
Submit_arguments = -l home_fs=true -l walltime=05:00 -- /usr/bin/hostname
comment = Can Never Run: Insufficient amount of resource: home_fs (True !=False)
Some questions:
why is home_fs listed in Resource_List.select? I did not do a select, so that should have been a job wide resource, should it not? I assume that is why I am getting the True != False because it isnāt defined on the vnode. We donāt want to have to define it on every node as changing it would then be extremely inconvenient.
The examples in the manual for custom resources, both static and dynamic, were consumable (counts of licenses) and either PBS keeps track of the count (static) or you need a script (dynamic). Can I use a boolean at all? If so, can I just use qmgr to change the value? If not, do I have to treat this as a dynamic resource and write a script that returns true or false?
Thanks for the suggestion. Some additional questions:
Your suggestion made me look at table 5-9 Resource Accumulation Flags on page AG-261. In this case, do I want to set the flag to q or do I want no flag at all? This is a boolean resource. The description of q says it is going to incremented by one and it must be consumable or time based. No flag is that it is not consumable which seems to be a better match?
If I want no flag, how do I clear it? will qmgr -c "set resource home_fs flag = '' " work?
I tried to make the change to a q and this is what I got:
(base) [allcock@edtb-01 20220216-16:57:04]> qmgr -c "set resource home_fs flag = q"
qmgr obj=home_fs svr=default: Resource busy on job
There is no job running or queued and I restarted PBS on the thought that maybe the resource was āstuckā. Same result. Any idea how I āunstickā the resource?
qmgr -c "unset resource home_fs " # if this did not work
a. source /etc/pbs.conf ; qterm -t quick
b. update or delete the resource line with respect to home_fs in the $PBS_HOME/server_priv/resourcedef
c. $PBS_EXEC/sbin/pbs_server
I think it is good idea to use the node health check script ( mom perioidic hook) to periodically check the file system on the nodes and if there is node health issue (file system not accessible or full or mount not available), then set the node offline.
The reason we donāt want to go that route is because we have multiple file systems. If one is down, we want to avoid running jobs that need that one, but can run jobs that donāt.
Thanks for your help with this. I did get the resource āunstuckā, though I would love to understand why that happened in the first place. However, it is not working as I had hoped. I thought it was. The value was True, I ran a job and it worked. I set the value to False and it didnāt, but now I set it back to True and queued jobs didnāt run, but neither are new jobs being submitted. I will continue to poke at this.
I believe you are right, you want no flags on the resource. The reason it showed up in the select is because of flag=h and the fact that you didnāt submit a select. When you donāt submit a select, the server will create one for you based on all of the flag=h resources you have requested. If you do submit a select, you canāt submit any flag=h resources as job wide resources.
The reason you got the āresource is busy on jobā message is that you can change very little about resource definitions when they are requested by a job in the system (or even in history). This means a submitted/history job, not just running.
You were right when you said that the job didnāt run because the nodes didnāt have it set. A node is considered to have 0 of a resource or will not match (as is the case here) if it isnāt set.
What Adarsh said to do by shutting down the server and changing the resourcedef file will work. It also can have some undesired effects because the server/scheduler isnāt going to expect things like a no-flag resource in a select.
Now onto the current problem at hand. Requesting a boolean resource at the job wide level should work fine. If the job doesnāt run because of it, youāll see a comment like 'Insufficient resources at server level XXX (True != False)". If you donāt see āserver levelā or donāt see the ā(True != False)ā then there is some other reason the job is not running. What is the comment on the jobs that arenāt running?
Once we have the current issue ironed out, something you could consider doing is write a server_dyn_res script for each file system doing a health check on it. This means you wonāt have to manually set the filesystem resource as True/False. The server_dyn_res script will do it iself.
FWIW, Where I used to work, we had a similar requirement, but for scratch file systems. This was with a much older version of PBS, and my unreliable memory is that I tried first using server boolean resources, but could not get the behavior we wanted. I switched to consumable resources, where a queuejob hook looked up the default scratch file system for the user and added a ā-l scratchX=1ā to the job. (If the job already specified some scratchX=Y values, the hook did nothing.) To enable use of file system scratchX, we set the server resources_available.scratchX to a large number (5000). To block starting jobs that requested the file system, we set resources_available.scratchX=0.
This had a few minor advantages over a simple boolean. First, some jobs did not need scratch space, so they could specify scratchX=0 and the job would not be blocked no matter what state scratchX was in. (Specifying a boolean scratchX=false for a job blocks the job until scratchX is down.) Second, the resources_assigned.scratchX values told you how many jobs were (probably) using a given file system. Third, you could set resources_available.scratchX to a small number as a coarse limit on the load to allow on the file system. Fourth, you could create a reservation to start at a specific time that requested all 5000 of the scratchX resources, thus scheduling a dedtime for just that file system. (Iām not sure we ever used thisāneeds testing.)
Interesting idea. My first thought was āWhy would anyone set it to Falseā, but then it occurred to me they might be thinking they were explicitly saying āI donāt need this file systemā which, as you say, would not work as intended.
The how many jobs are using the filesystem might come in handy, though I donāt recall us ever needing to know that. I doubt we would ever use it as a throttle. The reservation idea might be useful for benchmarking, but I could also see a user setting it to 5000 on their own and blocking other users when we didnāt want them to. If we wrote a hook I guess we could overwrite any value greater than zero to be one unless it was a manager or something along those lines. I would also have to think about how that figures into the prioritization calculation.
It is sort of working, but not really in a practical way? I thought adding an explicit select so that it would consider home_fs as a job wide resource would get this to work, but not such luck:
(base) [allcock@edtb-01 20220222-22:39:49]> qsub -l home_fs=true -l walltime=05:00 -l select=ncpus=8 -- /usr/bin/hostname
2943.edtb-01.mcp.alcf.anl.gov
(base) [allcock@edtb-01 20220222-22:40:34]> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2943.edtb-01 STDIN allcock 0 Q workq
(base) [allcock@edtb-01 20220222-22:40:36]> qstat -f 2943
Job Id: 2943.edtb-01.mcp.alcf.anl.gov
Job_Name = STDIN
Job_Owner = allcock@edtb-01.mcp.alcf.anl.gov
job_state = Q
queue = workq
server = edtb-01.mcp.alcf.anl.gov
Checkpoint = u
ctime = Tue Feb 22 22:40:34 2022
Error_Path = edtb-01.mcp.alcf.anl.gov:/home/allcock/STDIN.e2943
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Feb 22 22:40:34 2022
Output_Path = edtb-01.mcp.alcf.anl.gov:/home/allcock/STDIN.o2943
Priority = 0
qtime = Tue Feb 22 22:40:34 2022
Rerunable = True
Resource_List.home_fs = True
Resource_List.ncpus = 8
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.preempt_targets = Queue=preemptable
Resource_List.select = ncpus=8
Resource_List.walltime = 00:05:00
schedselect = 1:ncpus=8
substate = 10
Variable_List = PBS_O_HOME=/home/allcock,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=allcock,
PBS_O_PATH=/opt/miniconda3/bin/:/home/allcock/miniconda3/bin:/home/all
cock/miniconda3/condabin:/opt/miniconda3/bin:/home/allcock/bin:/usr/loc
al/bin:/usr/bin:/bin:/opt/pbs/bin:/home/allcock/bin,
PBS_O_MAIL=/var/spool/mail/allcock,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/allcock,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=edtb-01.mcp.alcf.anl.gov
euser = allcock
egroup = users
queue_rank = 1645569634105414
queue_type = E
comment = Can Never Run: Insufficient amount of queue resource: home_fs (Tr
ue != False)
etime = Tue Feb 22 22:40:34 2022
Submit_arguments = -l home_fs=true -l walltime=05:00 -l select=ncpus=8 -- /
usr/bin/hostname
executable = <jsdl-hpcpa:Executable>/usr/bin/hostname</jsdl-hpcpa:Executabl
e>
project = _pbs_project_default
Submit_Host = edtb-01.mcp.alcf.anl.gov
server_instance_id = edtb-01.mcp.alcf.anl.gov:15001
(base) [allcock@edtb-01 20220222-22:40:50]> qmgr -c "list home_fs"
qmgr: Illegal object type: home_fs.
(base) [allcock@edtb-01 20220222-22:42:29]> qmgr -c "list resource home_fs"
Resource home_fs
type = boolean
(base) [allcock@edtb-01 20220222-22:42:38]> qmgr -c "print server" | grep home_fs
# Create and define resource home_fs
create resource home_fs
set resource home_fs type = boolean
set server resources_available.home_fs = True
I assume the reason it is saying True != False is because it is looking for the resource on the node, rather than the server?
And then things got interesting. I decided maybe putting the flag on and taking it off was the problem, so I created a new resource called test_fs:
Then I started testing (2943 is left from before and is using home_fs rather than test_fs). Here is what I did:
Set test_fs to false
submitted job 2946 depending on test_fs=True, it did not start and had test_fs )True != False)
I set test_fs=True and checked to see if it started; It did not
I submitted job 2947 also requiring test_fs=True in an attempt to force a scheduling cycle; None of the jobs started, all saying True != False
Then I restarted the PBS server to see if that would make things work. It did not immediately, but I came back 30 minutes later and all the jobs had run, including the one depending on home_fs
Thoughts?
(base) [allcock@edtb-01 20220222-22:55:54]> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2943.edtb-01 STDIN allcock 0 Q workq
(base) [allcock@edtb-01 20220222-22:56:02]> qmgr -c "print server" | grep test_fs
# Create and define resource test_fs
create resource test_fs
set resource test_fs type = boolean
set server resources_available.test_fs = False
(base) [allcock@edtb-01 20220222-22:56:41]> qsub -l test_fs=true -l walltime=05:00 -l select=ncpus=8 -- /usr/bin/hostname
2946.edtb-01.mcp.alcf.anl.gov
(base) [allcock@edtb-01 20220222-22:56:51]> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2943.edtb-01 STDIN allcock 0 Q workq
2946.edtb-01 STDIN allcock 0 Q workq
(base) [allcock@edtb-01 20220222-22:56:54]> qstat -f 2946 | grep comment
comment = Can Never Run: Insufficient amount of queue resource: test_fs (Tr
(base) [allcock@edtb-01 20220222-22:57:15]> qmgr -c "set server resources_available.test_fs=True"
(base) [allcock@edtb-01 20220222-22:57:35]> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2943.edtb-01 STDIN allcock 0 Q workq
2946.edtb-01 STDIN allcock 0 Q workq
(base) [allcock@edtb-01 20220222-22:57:39]> qsub -l test_fs=true -l walltime=05:00 -l select=ncpus=8 -- /usr/bin/hostname
2947.edtb-01.mcp.alcf.anl.gov
(base) [allcock@edtb-01 20220222-22:57:55]> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2943.edtb-01 STDIN allcock 0 Q workq
2946.edtb-01 STDIN allcock 0 Q workq
2947.edtb-01 STDIN allcock 0 Q workq
(base) [allcock@edtb-01 20220222-22:57:58]> qstat -f 2947 | grep comment
comment = Can Never Run: Insufficient amount of queue resource: test_fs (Tr
(base) [allcock@edtb-01 20220222-22:58:11]> sudo systemctl restart pbs
(base) [allcock@edtb-01 20220222-22:58:41]> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2943.edtb-01 STDIN allcock 0 Q workq
2946.edtb-01 STDIN allcock 0 Q workq
2947.edtb-01 STDIN allcock 0 Q workq
(base) [allcock@edtb-01 20220222-22:58:44]> stat
stat: missing operand
Try 'stat --help' for more information.
(base) [allcock@edtb-01 20220222-22:59:25]> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2943.edtb-01 STDIN allcock 0 Q workq
2946.edtb-01 STDIN allcock 0 Q workq
2947.edtb-01 STDIN allcock 0 Q workq
(base) [allcock@edtb-01 20220222-22:59:30]> qstat
(base) [allcock@edtb-01 20220222-23:26:18]> qstat -f 2947 | grep comment
qstat: 2947.edtb-01.mcp.alcf.anl.gov Job has finished, use -x or -H to obtain historical job information
(base) [allcock@edtb-01 20220222-23:26:51]> qstat -xf 2947 | grep comment
comment = Job run at Tue Feb 22 at 23:08 on (edtb-01[0]:ncpus=8) and finish
(base) [allcock@edtb-01 20220222-23:26:57]> qstat -xf 2946 | grep comment
comment = Job run at Tue Feb 22 at 23:08 on (edtb-01[0]:ncpus=8) and finish
(base) [allcock@edtb-01 20220222-23:27:13]> qstat -xf 2943 | grep comment
comment = Job run at Tue Feb 22 at 23:08 on (edtb-01[0]:ncpus=8) and finish
You found a bug. Nice catch. For server and queue level resources, if a value is unset, the resource should be ignored. This is the case for all resource types other than booleans. It should be true for booleans.
The PBS scheduler is getting a C++ facelift. In the past couple of years we compiled our C code with a C++ compiler. Ever since then, weāve been updating the scheduler to use C++ constructs. Recently, the resource comparison code was refactored. Thatās where this bug slipped in.
Please make the following change in the function find_check_resource() (check.cpp):
Existing code:
if (resreq->type.is_boolean)
res = fres;
Change it to:
if (resreq->type.is_boolean && (flags & UNSET_RES_ZERO))
res = fres;
This should fix your problem.
Either that or use a consumable resource like @dtalcott suggested. It doesnāt run into the bug.
In any case, Iāll file the bug and see about getting it fixed.
Actually another option is to just set the resource at the queue level as well. Thatās probably unmanageable since you want one boolean per fileserver.
I donāt understand this part. Restarting the server should have no effect on the bug in question. It should always be the case. The scheduler will think that an unset queue resource is false. If it didnāt run once, it should never run (as the comment said).