PBS has configured with by_queue and round robin disabled. Users jobs queued as FIFO. But we want to have fairshare policy with the queue so that resources shared. somehow fairshare configuration not working for us. Any suggestions ?
Steps:
Enabled below into configuration file /opt/pbs/etc/pbs_sched_config
It looks like you modified the wrong file. You need to change the sched_config file in pbs home. The file in pbs exec is provided as a reference copy of the default sched_config file. Edit /var/spool/pbs/sched_priv/sched_config and make the same changes you made to the file in pbs exec.
Now, we did change into the correct file āsched_configā. Below changes seems to be effective. however, how do we test if fairshare policy is working fine?
Please submit X number of jobs as User1 , let them run and finish.
Set the scheduling to false : qmgr -c āset server scheduling = falseā
Submit another X number of jobs as User1
Submit another Y number of jobs as User2
Set the scheduling to true : qmgr -c āset server scheduling = trueā
Check the output of the pbsfs command and qstat -answ1
You have two choices now. You can populate your resource_group file with all your users (and possibly subdivide them into fairshare groups) or you can turn fairshare_enforce_no_shares to false. By default, the scheduler will not run any jobs that are not in the resource_group file.
As per the option 1, we created the resource group.
Pasting few lines from the command outputs.
What we found is that usersā jobs in the queue are not sorted as per cput utilization. Instead its still FIFO only. How to fix it?
39826 sigbin ambreesh.khurana 709:57:1 R small
39827 sigbin ambreesh.khurana 678:39:4 R small
39828 sigbin ambreesh.khurana 622:23:1 R small
39829 sigbin ambreesh.khurana 678:43:5 R small
39830 sigbin ambreesh.khurana 571:26:3 R small
39831 sigbin ambreesh.khurana 603:58:3 R long
39832 sigbin ambreesh.khurana 579:41:1 R long
39833 sigbin ambreesh.khurana 563:55:1 R long
39826 sigbin ambreesh.khurana 709:57:1 R small
39827 sigbin ambreesh.khurana 678:39:4 R small
39828 sigbin ambreesh.khurana 622:23:1 R small
39829 sigbin ambreesh.khurana 678:43:5 R small
39830 sigbin ambreesh.khurana 571:26:3 R small
39831 sigbin ambreesh.khurana 603:58:3 R long
39832 sigbin ambreesh.khurana 579:41:1 R long
39833 sigbin ambreesh.khurana 563:55:1 R long
We performed the testing as per the suggested method. We found that queue is working in FIFO only.
User1 submitted many jobs to queue. Few jobs of User1 were in the queue and remaining running. After some time USer2 submitted few jobs and now all of Users2 jobs queued.
Once User1 jobs completed, one followed by another User1 jobs only executed. Finally after all user1 jobs got over then only User2ās jobs started running.
Expected order was let the USer1 jobs completed but next jobs should run from USer2 as his jobs were in waiting.
When fairshare is enabled, PBS Pro will consider usage of the system with respect to the entity , the next entity to run the job would have used the cluster resources less in comparison to others.
user1 = usage is 75
user2 = usage is 40
user3 = usage is 30
user3 will get a chance first, than user2 and user1
First question: did you HUP or restart the scheduler after setting up fairshare? The config file is not reread until so.
The other thing is you shouldnāt be seeing usages of 0 unless you specifically set it so. The scheduler will set the usage to 1 by default. Did you do a pbsfs -s on the entities to 0?
Hey @amolthute
Sorry for the delayed response. I was on vacation.
For the usage being 0, Iām just surprised. The default value of any entity is 1. You can see this with the unknown group. Its usage is 1. The usage of the two users you have are 0. Usually the only way that can happen is if someone specifically sets it with pbsfs. Can you check if /var/spool/pbs/sched_priv/usage.bak is there? That means someone has run pbsfs -s before. Maybe another admin might have run it?
Your setup does look like it is properly set up. When you start the scheduler, does the log report any errors when parsing the config file?
Just a random question (this has come up before): are you modifying /var/spool/pbs/sched_priv/sched_config and not /opt/pbs/etc/pbs_sched_config? The latter is just a reference copy to the default sched_config.
Are you in a multi-sched environment? If so, the value of sched_priv might not be properly set, so the sched_config file you are modifying is not being used.
Now weāre grasping at straws here. Everything looks like it is set up properly.
The only two things which can perturb the fairshare order is preemption and starving jobs. Are you using preemption? Are any of your out-of-order jobs starving? Do you have a job_sort_formula?
Letās check if it is set up properly, but you are expecting different behavior. If that is the case, we can modify the configuration to fit your expectations.
Do the following:
turn scheduling off: qmgr -c ās s scheduling=fā
submit your jobs
turn scheduling back on: qmgr -c ās s scheduling=tā
Look at the logs for the cycle which was just run. The fairshare order is the order the lines with āConsidering job to runā. Is that order what you expect?
The order of the schedulerās sort is the following:
Is a job in a reservation?
The preemption priority of a job (i.e. express queue jobs vs normal jobs)
Has a job been preempted?
Is a job starving?
Is there a job_sort_formula?
Fairshare
If a job fits into 1-5, it will be out of order with fairshare.