I want to prevent jobs from running on a system for some maintenance to take place. There are 24 hour walltime limits on the queues on the system.
From looking at the docs I think I can either use a reservation, or a dedicated time. I think a dedicated time is the way to go.
So if I want the system down for a full day, and there is a queue limit of 24 hours, do I need to have a two day dedicated time, in order to drain the jobs?
So e.g. its now 9:19 on the 7th of Dec - the following would have the running jobs drained by tomorrow at 10:00, and no jobs would be able to run unless in a dedicated queue until 00:00 on the 9th:
No jobs can start if they would conflict with a dedicated time. Already running jobs ignore dedicated times.
So, in your example, you would create the dedicated time entry to exactly match when you want your dedicated jobs to start/run. Then, so long as you create the entry at least 24 hours ahead of time, the non-dedicated jobs will have idled out by the start of dedicated time.
When the dedicated time arrives, the PBS server will start the ded queues automatically. When the ded time ends, PBS stops the queues. (I don’t remember whether the end time is a hard stop for ded jobs, or if ded jobs that have already started can continue to run.)
The jobs that were submitted earlier before enablint the dedicated_time would keep on running during the dedicated time period, these jobs would not be killed. But any jobs submitted after enabling dedicated_time (kill -HUP pid_of_pbs_sched) and crossing the dedicated time boundary will be kept in the queued state.
Example:
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
4341.pbsserver pbsdata workq STDIN 43173 1 1 -- -- R 00:01
pbsserver/0
Job run at Wed Dec 08 at 11:20 on (pbsserver:ncpus=1)
4342.pbsserver pbsdata workq STDIN -- 1 1 -- -- Q --
--
Not Running: Job would cross dedicated time boundary
I have tried to set up a queue to use this dedicated time window and run some tests, but i get the following error when submitting a job:
qsub: Unauthorized request
Looking at the logs I get:
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;gw_job_setup as queuejob for ba-corourke@nid00004 for new job: started
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;set _job.sandbox = private
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;set _job.Account_Name = 'GW02'
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;set _job.project = GW02
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;set job.Resource_List[group_proportion] = -1.0
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;set _job.umask = 22L
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;gw_job_setup as queuejob for ba-corourke@nid00004 for new job: completed after 0.0012s
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;job_default_coretype as queuejob for ba-corourke@nid00004 for new job: started
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;cluster is isambard
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;original select = 1
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;updated select = 1:coretype=arm
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;original select = 1:coretype=arm
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;updated select = 1:ncpus=64:coretype=arm:mpiprocs=64
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;job_default_coretype as queuejob for ba-corourke@nid00004 for new job: completed after 0.0009s
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;extra_walltime_limits as queuejob for ba-corourke@nid00004 for new job: started
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;soft limits not defined for ded-test queue
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;no soft limit defined for job
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;extra_walltime_limits as queuejob for ba-corourke@nid00004 for new job: completed after 0.0031s
12/10/2021 11:32:31;0006;Server@sdb;Hook;Server@sdb;set _job.project = GW02
12/10/2021 11:32:31;0080;Server@sdb;Req;req_reject;Reject reply code=15007, aux=0, type=1, from ba-corourke@nid00004
Does any one know how to fix this problem?
I have enabled acl_groups and acl_users and added myself to both.