How can I scatter (i.e. round-robin) multiple jobs over the vnodes?
With our current configuration, when I submit 5 jobs that uses a little amount of shared resource, they are assigned to a single vnode together:
vnode1: job1, job2, job3, job4, job5
I want them to be assigned to different vnodes so as to avoid speeddown caused by resource contention:
vnode1: job1, job4
vnode2: job2, job5
I know I can scatter multiple chunks in a single job, but I failed to find the way to scatter multiple independent jobs.
Any comments or suggestions would greatly be appreciated.
I think node_sort_key can help you. This option can be set in $PBS_HOME/sched_priv/sched_config:
node_sort_key: “<resource> LOW assigned” ALL
Do not forget to kill -HUP the scheduler after saving the file. More info on node_sort_key can be found in the admin guide: “220.127.116.11 node_sort_key Syntax”.
That’s exactly what I wanted!
Thank you for your kind support
I think add place line after your select line in jobscript
#PBS -l place=scatter
would also do this if you want to control specific jobs rather than global configuation
Thank you for your reply.
When I tried e.g. doing
$ qsub -lselect=ncpus=1 -lplace=scatter test.sh
three times, then those jobs were placed on to a single machine, which was what I did not wanted.
Yes, you are right, I made a mistake. place=scatter only affect chunks within a single job.
Thank you for your comment.
“:excl” prevents multiple jobs to be assigned to a single node (even if the node has enough amount of resource) when I submit more jobs than the number of nodes.
Did you get this done? I am moving to OpenHPC w/ PBS Pro from Rocks Cluster and was working Queue’s and have the same question. Currently everything gets piled up on C1 before moving to C2,3,4… I would also like to have jobs distributed accross nodes evenly…
Please use the below PBS directive for multi-node jobs :
#PBS -l place=scatter
qsub -l select=4:ncpus=4 -l place=scatter – /bin/application arguments
#PBS -l select=4:ncpus=4
#PBS -l place=scatter
“qsub: Cannot be used with select or place: nodes”
I think that is a different setting that I am not really looking for. I don’t want to set it within the script anyway, It should be set globally for how jobs are scheduled across the nodes.
In Maui.cfg, it was set here…
This setting makes it so that Maui will assign jobs to nodes that have the lowest load and least amount of jobs.
Could you please share the pbs directives used in your script or qsub submission ?
Also could you please share the version of PBS Pro you are using ?
Please check the this file source /etc/pbs.conf;$PBS_HOME/sched_priv/sched_config and check for node_sort_key.
Please check this document
https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf and the below sections of this document:
18.104.22.168 node_sort_key Syntax
22.214.171.124.i Examples of Sorting Vnodes
After making any updates to sched_config file, make sure you kill -HUP
Hope this helps.
Does “kill -HUP” do the same thing as “systemctl restart pbs” ? Will this kill jobs that are currently running?
Not the same, some configuration needs systemctl restart pbs
Please restart it when no jobs are running on the system, otherwise, job would be killed or requeued.
Please check these sections of this guide: https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf
Chapter 7 Starting & Stopping PBS
Table 7-2: Commands to Start, Stop, Restart, Status PBS
Table 7-3: MoM Restart Options
Thanks for your help, one last question. Can you restart the scheduler without impacting running jobs?
126.96.36.199.i Reinitializing the Scheduler on Linux
ps –ef | grep pbs_sched
kill -HUP ‘scheduler PID’
PBS Scheduler is stateless. You can kill the scheduler and start it at any point.
You can do this:
qmgr -c “set server scheduling = true” # to start a new scheduling cycle
Still trying to get this to work, looked at “node_sort_key”, but I think that is more along the lines of what you would use to sort nodes if you had a bunch of nodes with varying configurations (ncpus, mem,) .etc.
I looked in the PBS Scheduler Config and saw “smp_cluster_dist” which seems exactly what I am trying to do, but it didn’t change the outcome.
Is there a required time-frame between job submissions? The reason I ask, is that the script that I am testing with actually runs through 20 or so jobs, creates the scripts, submits them, and then loops back to run the next job until all are done. So, when i submit, it creates 20 jobs, in a second or so.
I have made the changes and still, when i submit jobs, they are still being scheduled more like the “pack” method, where one node gets filled up to capacity before jobs spill over to the next node. I changed the "smp_cluster_dist: to “lowest_load” thinking that would help ,but still no good.
cat /opt/pbs/etc/pbs_sched_config | grep -v ‘#’ | grep -v -e ‘^$’
round_robin: False all
by_queue: True prime
by_queue: True non_prime
strict_ordering: false ALL
help_starving_jobs: true ALL
backfill_prime: false ALL
node_sort_key: “ncpus LOW” ALL
sort_queues: true ALL
resources: “ncpus, mem, arch, host, vnode, aoe, eoe”
load_balancing: true ALL
fair_share: true ALL
preemptive_sched: true ALL
preempt_prio: “express_queue, normal_jobs”
Please share your pbsnodes -aSj and chunk submission line , what job placement string do you use ?
No, there is no time frame, thousands of jobs are submitting within a minute for some usecases.
smp_cluster_dist is deprecated. please check the PBS Pro administrator guide.
have made the changes and still, when i submit jobs, they are still being scheduled more like the “pack” method, where one node gets filled up to capacity before jobs spill over to the next node.
4.7 Specifying Job Placement in the PBS Professional User Guide https://www.altair.com/pdfs/pbsworks/PBSUserGuide19.2.3.pdf
and use the below node_sort_key
node_sort_key: “ncpus LOW assigned” ALL
Still “packs” all jobs on a node, then moves to next.
Maybe I have something wrong with my “Queue” settings? I didn’t do much on that part yet.
queue_type = Execution
Priority = 50
total_jobs = 15
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:15 Exiting:0 Begun:0
resources_max.mem = 24000mb
resources_max.ncpus = 16
resources_max.nodes = 4
resources_max.walltime = 96:00:00
resources_default.ncpus = 1
resources_default.walltime = 24:00:00
resources_assigned.mem = 360000mb
resources_assigned.mpiprocs = 120
resources_assigned.ncpus = 120
resources_assigned.nodect = 15
hasnodes = True
enabled = True
started = True
[~]# pbsnodes -a
Mom = compute-00.local.cluster
Port = 15002
pbs_version = 19.1.1
ntype = PBS
state = free
pcpus = 72
jobs = 22669.hmrihpcp02/0, 22669.hmrihpcp02/1, 22669.hmrihpcp02/2, 22669.hmrihpcp02/3
resources_available.arch = linux
resources_available.host = compute-00
resources_available.mem = 394618508kb
resources_available.ncpus = 72
resources_available.vnode = compute-00
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 56
resources_assigned.vmem = 0kb
queue = default
resv_enable = True
sharing = default_shared
last_state_change_time = Mon May 18 10:29:21 2020
last_used_time = Mon May 18 10:29:21 2020
Thank you for sharing us the details
Your queue settings are correct
If you have only one compute node in the PBS Cluster
then all the jobs have to be packed within that node
If you have more than one compute node , then you can see the jobs being scheduled on to another node when you have node_sort_key: “ncpus LOW assigned” ALL (kill -HUP < pid of the pbs_sched > based on the ncpus resources allocation.
Please share the output of pbsnodes -aSj and qmgr -c ‘p s’