How to scatter jobs over vnodes?

Ikki · February 8, 2018, 6:33am

Hi folks,

How can I scatter (i.e. round-robin) multiple jobs over the vnodes?

With our current configuration, when I submit 5 jobs that uses a little amount of shared resource, they are assigned to a single vnode together:

vnode1: job1, job2, job3, job4, job5
vnode2: (vacant)
vnode3: (vacant)

I want them to be assigned to different vnodes so as to avoid speeddown caused by resource contention:

vnode1: job1, job4
vnode2: job2, job5
vnode3: job3

I know I can scatter multiple chunks in a single job, but I failed to find the way to scatter multiple independent jobs.
Any comments or suggestions would greatly be appreciated.
Thank you,

vchlum · February 8, 2018, 11:02am

Hi @Ikki,

I think node_sort_key can help you. This option can be set in $PBS_HOME/sched_priv/sched_config:

node_sort_key: “<resource> LOW assigned” ALL

Do not forget to kill -HUP the scheduler after saving the file. More info on node_sort_key can be found in the admin guide: “4.8.50.1 node_sort_key Syntax”.

Vasek

Ikki · February 8, 2018, 11:12am

Hi @vchlum,

That’s exactly what I wanted!
Thank you for your kind support

Regards,

Source · March 25, 2018, 8:47am

I think add place line after your select line in jobscript
i.e.
#PBS -l place=scatter
would also do this if you want to control specific jobs rather than global configuation

Ikki · March 26, 2018, 1:12am

Hi @source,

Thank you for your reply.

When I tried e.g. doing
$ qsub -lselect=ncpus=1 -lplace=scatter test.sh
three times, then those jobs were placed on to a single machine, which was what I did not wanted.

Regards,

Source · March 26, 2018, 8:51am

Yes, you are right, I made a mistake. place=scatter only affect chunks within a single job.

pcebull · March 26, 2018, 1:06pm

Use place=scatter:excl

Ikki · March 27, 2018, 9:13am

Hi @pcebull,

Thank you for your comment.
“:excl” prevents multiple jobs to be assigned to a single node (even if the node has enough amount of resource) when I submit more jobs than the number of nodes.

Regards,

rcole · April 11, 2020, 1:15pm

Did you get this done? I am moving to OpenHPC w/ PBS Pro from Rocks Cluster and was working Queue’s and have the same question. Currently everything gets piled up on C1 before moving to C2,3,4… I would also like to have jobs distributed accross nodes evenly…

Thanks

adarsh · April 14, 2020, 2:21pm

Please use the below PBS directive for multi-node jobs :
#PBS -l place=scatter

example:
qsub -l select=4:ncpus=4 -l place=scatter – /bin/application arguments

cat pbs.sh

#!/bin/bash
#PBS -l select=4:ncpus=4
#PBS -l place=scatter
/bin/application argument

rcole · April 14, 2020, 3:19pm

“qsub: Cannot be used with select or place: nodes”

I think that is a different setting that I am not really looking for. I don’t want to set it within the script anyway, It should be set globally for how jobs are scheduled across the nodes.

In Maui.cfg, it was set here…

NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITYF=-1.0*JOBCOUNT

This setting makes it so that Maui will assign jobs to nodes that have the lowest load and least amount of jobs.

adarsh · April 14, 2020, 8:37pm

Could you please share the pbs directives used in your script or qsub submission ?
Also could you please share the version of PBS Pro you are using ?

Please check the this file source /etc/pbs.conf;$PBS_HOME/sched_priv/sched_config and check for node_sort_key.

Please check this document https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf and the below sections of this document:
4.9.50.1 node_sort_key Syntax
4.9.50.2.i Examples of Sorting Vnodes

After making any updates to sched_config file, make sure you kill -HUP

Hope this helps.

rcole · April 14, 2020, 9:49pm

Does “kill -HUP” do the same thing as “systemctl restart pbs” ? Will this kill jobs that are currently running?

adarsh · April 14, 2020, 10:54pm

Not the same, some configuration needs systemctl restart pbs
Please restart it when no jobs are running on the system, otherwise, job would be killed or requeued.

[updated]
Please check these sections of this guide: https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf
Chapter 7 Starting & Stopping PBS
Table 7-2: Commands to Start, Stop, Restart, Status PBS
Table 7-3: MoM Restart Options

rcole · April 15, 2020, 12:19am

Thanks for your help, one last question. Can you restart the scheduler without impacting running jobs?

7.2.7.3.i Reinitializing the Scheduler on Linux
ps –ef | grep pbs_sched
kill -HUP ‘scheduler PID’

adarsh · April 15, 2020, 6:54am

PBS Scheduler is stateless. You can kill the scheduler and start it at any point.

You can do this:

kill -HUP
qmgr -c “set server scheduling = true” # to start a new scheduling cycle

Thank you

rcole · May 18, 2020, 1:13pm

Still trying to get this to work, looked at “node_sort_key”, but I think that is more along the lines of what you would use to sort nodes if you had a bunch of nodes with varying configurations (ncpus, mem,) .etc.
I looked in the PBS Scheduler Config and saw “smp_cluster_dist” which seems exactly what I am trying to do, but it didn’t change the outcome.

Is there a required time-frame between job submissions? The reason I ask, is that the script that I am testing with actually runs through 20 or so jobs, creates the scripts, submits them, and then loops back to run the next job until all are done. So, when i submit, it creates 20 jobs, in a second or so.

I have made the changes and still, when i submit jobs, they are still being scheduled more like the “pack” method, where one node gets filled up to capacity before jobs spill over to the next node. I changed the "smp_cluster_dist: to “lowest_load” thinking that would help ,but still no good.

cat /opt/pbs/etc/pbs_sched_config | grep -v ‘#’ | grep -v -e ‘^$’

round_robin: False all
by_queue: True prime
by_queue: True non_prime
strict_ordering: false ALL
help_starving_jobs: true ALL
max_starve: 24:00:00
backfill_prime: false ALL
prime_exempt_anytime_queues: false
primetime_prefix: p_
nonprimetime_prefix: np_
node_sort_key: “ncpus LOW” ALL
provision_policy: “aggressive_provision”
sort_queues: true ALL
resources: “ncpus, mem, arch, host, vnode, aoe, eoe”
load_balancing: true ALL
smp_cluster_dist: lowest_load
fair_share: true ALL
unknown_shares: 10
fairshare_usage_res: cput
fairshare_entity: euser
fairshare_decay_time: 24:00:00
fairshare_decay_factor: 0.5
preemptive_sched: true ALL
preempt_queue_prio: 150
preempt_prio: “express_queue, normal_jobs”
preempt_order: “SCR”
preempt_sort: min_time_since_start
dedicated_prefix: ded
log_filter: 3328

adarsh · May 18, 2020, 1:45pm

Please share your pbsnodes -aSj and chunk submission line , what job placement string do you use ?

No, there is no time frame, thousands of jobs are submitting within a minute for some usecases.

smp_cluster_dist is deprecated. please check the PBS Pro administrator guide.

Please check:
4.7 Specifying Job Placement in the PBS Professional User Guide https://www.altair.com/pdfs/pbsworks/PBSUserGuide19.2.3.pdf

and use the below node_sort_key
node_sort_key: “ncpus LOW assigned” ALL

rcole · May 18, 2020, 3:33pm

Still “packs” all jobs on a node, then moves to next.

Maybe I have something wrong with my “Queue” settings? I didn’t do much on that part yet.

Queue: default
queue_type = Execution
Priority = 50
total_jobs = 15
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:15 Exiting:0 Begun:0
resources_max.mem = 24000mb
resources_max.ncpus = 16
resources_max.nodes = 4
resources_max.walltime = 96:00:00
resources_default.ncpus = 1
resources_default.walltime = 24:00:00
resources_assigned.mem = 360000mb
resources_assigned.mpiprocs = 120
resources_assigned.ncpus = 120
resources_assigned.nodect = 15
hasnodes = True
enabled = True
started = True

[~]# pbsnodes -a
compute-00
Mom = compute-00.local.cluster
Port = 15002
pbs_version = 19.1.1
ntype = PBS
state = free
pcpus = 72
jobs = 22669.hmrihpcp02/0, 22669.hmrihpcp02/1, 22669.hmrihpcp02/2, 22669.hmrihpcp02/3
resources_available.arch = linux
resources_available.host = compute-00
resources_available.mem = 394618508kb
resources_available.ncpus = 72
resources_available.vnode = compute-00
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 56
resources_assigned.vmem = 0kb
queue = default
resv_enable = True
sharing = default_shared
last_state_change_time = Mon May 18 10:29:21 2020
last_used_time = Mon May 18 10:29:21 2020

adarsh · May 18, 2020, 4:50pm

Thank you for sharing us the details

Your queue settings are correct
If you have only one compute node in the PBS Cluster
- then all the jobs have to be packed within that node
If you have more than one compute node , then you can see the jobs being scheduled on to another node when you have node_sort_key: “ncpus LOW assigned” ALL (kill -HUP < pid of the pbs_sched > based on the ncpus resources allocation.

Please share the output of pbsnodes -aSj and qmgr -c ‘p s’

Topic		Replies	Views
Job not getting distributed among nodes Users/Site Administrators	41	3070	June 19, 2022
Cannot run a job on multiple nodes Users/Site Administrators	2	462	March 15, 2024
How can I select more than one placement group (node_group_key) Users/Site Administrators	1	514	July 1, 2022
Limit one job per node Users/Site Administrators	5	3007	September 16, 2018
Is there any way to tell job's vnode within it jobscript Users/Site Administrators	3	763	March 28, 2018

How to scatter jobs over vnodes?

cat /opt/pbs/etc/pbs_sched_config | grep -v ‘#’ | grep -v -e ‘^$’

Related topics