How to Restrict maximum 2 jobs/per user in each scheduling cycle

Let’s say a user submits 5 Jobs back-to-back and My PBS scheduling cycle is 30 sec.
Now All 5 jobs are in Q state . How to restrict only top 2 queued jobs from that user in priority order are evaluated for sending them to Run state (Assuming resources are available for all 5 queued jobs in next scheduling cycle) .

Have you looked at the max_run parameter? You could set a limit of 2 jobs running per user with

qmgr -c 'set server max_run="[u:PBS_GENERIC=2]"'

If that is not what you want, what is the higher-level goal?

Hi dtalcott,

Thanks for the quick response .

However our requirement is bit different

We have configured server_dyn_res: “VCPU_AVAIL !/opt/pbs/sbin/free_vcpus.sh” to limit the number cloud instances .The Scheduler evaluates the FREE_VCPU dynamic resource in every scheduling cycle(i.e 30 sec) and instead of examining job by job ,scheduler sends all the jobs to Run state those passed through the VCPU evaluation criteria which causes VCPU quota limit . Following Test case scenario shows how to reproduce the issue.
Where VCPU = 2 X NCPU (considering HT on)

*Step1: Edited the /opt/pbs/sbin/free_vcpus.sh to show Avilable free VCPUs to show max capping as 74 *

$ /opt/pbs/sbin/free_vcpus.sh
74

Step-2: Submit a 48 core (i.e 96 VCPU) Job .It should wait in Q as expected.

$ qstat -aws|grep -A1 parallel
3079.ip-10-77-162-79 testuser1 parallel CFX – 48 48 – 05:00 Q –
Can Never Run: Insufficient amount of server resource: VCPU_AVAIL (R: 96 A: 76 T: 76)

Step-3: Then Submit 2 consecutive 24 core (i.e 48 VCPU) Jobs . I am expecting 1 should go to Running state and another should wait . But both are going to Running state to initiate instance creation and there by causing over consumption VCPU limit

$ qstat -aws|grep -A1 parallel
3079.ip-10-77-162-79 testuser1 parallel CFX – 48 48 – 05:00 Q –
Can Never Run: Insufficient amount of server resource: VCPU_AVAIL (R: 96 A: 76 T: 76)
3080.ip-10-77-162-79 testuser1 parallel CFX – 24 24 – 15:00 R –
3081.ip-10-77-162-79 testuser1 parallel CFX – 24 24 – 15:00 R –

2nd Requirement : Although I have set backfill_depth =2 for the queue “parallel” .I am expecting that the scheduler should first reserve the resources top 2 jobs in order before sending lower priority jobs to run state which is not happening in this case (i.e Block all other low priority jobs unless they don’t affect the execution of top 2 jobs). Please note the wall time set is 5 Hrs for first job and other second and third are 15 hours.

$ qmgr -c "print queue parallel "|grep depth
set queue parallel backfill_depth = 2

Let me know if we can tune some of the parameters to meet this requirement.

+1 @dtalcott has suggested , there are other limits which can be applied

The below might help:

You can use job_sort_formula_threshold , to ignore the jobs that are below the threshold.
image

What about the 2nd Requirement: backfill_depth = 2 which is not honored at all.

backfill_depth is to find the number top jobs that can be filled around the calendared jobs and would not affect the start time of these calendered jobs. The scheduler does not reserve these jobs and its resources, as the situation might vary and in the next scheduling cycle based on the other jobs in the queue, fresh jobs, some other jobs can be a candidate for backfilling. The landscape might change for the backfillable jobs, but the calendared jobs remain intact.

backfill_depth is used to make smaller jobs run between large jobs without affecting the start time of larger jobs, otherwise, smaller jobs gets backfilled and will start to push the large jobs way longer in the future and large jobs would not get time to run.

As per my understanding in order to backfill work properly (i.e to make smaller jobs run between large jobs without affecting the start time of larger jobs) the scheduler should be able to estimate the approximate start time of the job which can be viewed using “qstat -T” command .

In my Case the “Est start Time” column showing blank, and the reason could be the scheduler is not aware when approximately the server_dyn_res: “VCPU_AVAIL !/opt/pbs/sbin/free_vcpus.sh” will be available.

So, it is better to disable using backfill_depth = 0 and let the the jobs run based on priority without backfilling.

Please correct me if I am wrong…

I think the problem with backfill is that it looks at only consumable or node-level resources; that is, things that can be reserved.

As adarsh suggested, you might try using a static server level consumable resource whose resources_available value is adjusted by a periodic hook to match the number of VCPUs defined by the current cloud resources.

There’s no need to if you want a limit per scheduling cycle. Just use a server_dyn_res uncounted (no “n” or “q” flag) custom resource and let the script return 2. The scheduler will consume up to 2 each cycle and no more.

Obviously the fact it’s “per cycle” means that in practice for many other use cases this is the wrong kind of resource, since it’s easy to have back to back scheduling cycles in which the job in the previous cycle will e.g. not yet have consumed a license, leading to overcommitment.

One more thing. You say “My PBS scheduling cycle is 30 sec”. The value of scheduler_iteration sets the longest interval the server will allow between scheduling cycles. If a new job is queued, the server kicks the scheduler to run right away. Similarly when a job ends.

I think this explains your observation where two 48 VCPU jobs started where you expected only one. When the first job was accepted, the server immediately kicked the scheduler. The scheduler sees only the one job and starts it (48 < 72). Meanwhile, the second job arrives. The server immediately triggers the scheduler again. free_vcpus.sh again reports 72. That is greater than 48, so the scheduler starts the second job.

You could verify this guess by looking at the sched_logs to see if the two jobs were started in the same scheduling cycle or in two one-after-the-other cycles.

Yes, Your explanation is correct but could you please suggest how to overcome this challenge?

Instead of using server_dyn_res, try static server-level resource, you can try the below and see whether this would help

* qmgr -c "create resource  fvcpus type=long,flag=q"
* qmgr -c "set server resources_available.fvcpus=74" # statically set to 74 vcpus 
* Add  fvcpus to the resources: line of the $PBS_HOME/sched_priv/sched_config file ( kill -HUP PID of the pbs_sched>
* qsub -l select=1:ncpus=48  -l fvcpus=48  -- /bin/sleep 100  #  check qstat -Bf output now