How to scatter jobs over vnodes?

Hi folks,

How can I scatter (i.e. round-robin) multiple jobs over the vnodes?

With our current configuration, when I submit 5 jobs that uses a little amount of shared resource, they are assigned to a single vnode together:

  • vnode1: job1, job2, job3, job4, job5
  • vnode2: (vacant)
  • vnode3: (vacant)

I want them to be assigned to different vnodes so as to avoid speeddown caused by resource contention:

  • vnode1: job1, job4
  • vnode2: job2, job5
  • vnode3: job3

I know I can scatter multiple chunks in a single job, but I failed to find the way to scatter multiple independent jobs.
Any comments or suggestions would greatly be appreciated.
Thank you,

Hi @Ikki,

I think node_sort_key can help you. This option can be set in $PBS_HOME/sched_priv/sched_config:

node_sort_key: “<resource> LOW assigned” ALL

Do not forget to kill -HUP the scheduler after saving the file. More info on node_sort_key can be found in the admin guide: “4.8.50.1 node_sort_key Syntax”.

Vasek

1 Like

Hi @vchlum,

That’s exactly what I wanted!
Thank you for your kind support :smiley:

Regards,

I think add place line after your select line in jobscript
i.e.
#PBS -l place=scatter
would also do this if you want to control specific jobs rather than global configuation

Hi @source,

Thank you for your reply.

When I tried e.g. doing
$ qsub -lselect=ncpus=1 -lplace=scatter test.sh
three times, then those jobs were placed on to a single machine, which was what I did not wanted.

Regards,

Yes, you are right, I made a mistake. place=scatter only affect chunks within a single job. :frowning:

Use place=scatter:excl

Hi @pcebull,

Thank you for your comment.
“:excl” prevents multiple jobs to be assigned to a single node (even if the node has enough amount of resource) when I submit more jobs than the number of nodes.

Regards,

Did you get this done? I am moving to OpenHPC w/ PBS Pro from Rocks Cluster and was working Queue’s and have the same question. Currently everything gets piled up on C1 before moving to C2,3,4… I would also like to have jobs distributed accross nodes evenly…

Thanks

Please use the below PBS directive for multi-node jobs :
#PBS -l place=scatter

example:
qsub -l select=4:ncpus=4 -l place=scatter – /bin/application arguments

cat pbs.sh

#!/bin/bash
#PBS -l select=4:ncpus=4
#PBS -l place=scatter
/bin/application argument

“qsub: Cannot be used with select or place: nodes”

I think that is a different setting that I am not really looking for. I don’t want to set it within the script anyway, It should be set globally for how jobs are scheduled across the nodes.

In Maui.cfg, it was set here…

NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITYF=-1.0*JOBCOUNT

This setting makes it so that Maui will assign jobs to nodes that have the lowest load and least amount of jobs.

Could you please share the pbs directives used in your script or qsub submission ?
Also could you please share the version of PBS Pro you are using ?

Please check the this file source /etc/pbs.conf;$PBS_HOME/sched_priv/sched_config and check for node_sort_key.

Please check this document https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf and the below sections of this document:
4.9.50.1 node_sort_key Syntax
4.9.50.2.i Examples of Sorting Vnodes

After making any updates to sched_config file, make sure you kill -HUP

Hope this helps.

Does “kill -HUP” do the same thing as “systemctl restart pbs” ? Will this kill jobs that are currently running?

Not the same, some configuration needs systemctl restart pbs
Please restart it when no jobs are running on the system, otherwise, job would be killed or requeued.

[updated]
Please check these sections of this guide: https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf
Chapter 7 Starting & Stopping PBS
Table 7-2: Commands to Start, Stop, Restart, Status PBS
Table 7-3: MoM Restart Options

Thanks for your help, one last question. Can you restart the scheduler without impacting running jobs?

7.2.7.3.i Reinitializing the Scheduler on Linux
ps –ef | grep pbs_sched
kill -HUP ‘scheduler PID’

PBS Scheduler is stateless. You can kill the scheduler and start it at any point.

You can do this:

  1. kill -HUP
  2. qmgr -c “set server scheduling = true” # to start a new scheduling cycle

Thank you

Still trying to get this to work, looked at “node_sort_key”, but I think that is more along the lines of what you would use to sort nodes if you had a bunch of nodes with varying configurations (ncpus, mem,) .etc.
I looked in the PBS Scheduler Config and saw “smp_cluster_dist” which seems exactly what I am trying to do, but it didn’t change the outcome.

Is there a required time-frame between job submissions? The reason I ask, is that the script that I am testing with actually runs through 20 or so jobs, creates the scripts, submits them, and then loops back to run the next job until all are done. So, when i submit, it creates 20 jobs, in a second or so.

I have made the changes and still, when i submit jobs, they are still being scheduled more like the “pack” method, where one node gets filled up to capacity before jobs spill over to the next node. I changed the "smp_cluster_dist: to “lowest_load” thinking that would help ,but still no good.

cat /opt/pbs/etc/pbs_sched_config | grep -v ‘#’ | grep -v -e ‘^$’

round_robin: False all
by_queue: True prime
by_queue: True non_prime
strict_ordering: false ALL
help_starving_jobs: true ALL
max_starve: 24:00:00
backfill_prime: false ALL
prime_exempt_anytime_queues: false
primetime_prefix: p_
nonprimetime_prefix: np_
node_sort_key: “ncpus LOW” ALL
provision_policy: “aggressive_provision”
sort_queues: true ALL
resources: “ncpus, mem, arch, host, vnode, aoe, eoe”
load_balancing: true ALL
smp_cluster_dist: lowest_load
fair_share: true ALL
unknown_shares: 10
fairshare_usage_res: cput
fairshare_entity: euser
fairshare_decay_time: 24:00:00
fairshare_decay_factor: 0.5
preemptive_sched: true ALL
preempt_queue_prio: 150
preempt_prio: “express_queue, normal_jobs”
preempt_order: “SCR”
preempt_sort: min_time_since_start
dedicated_prefix: ded
log_filter: 3328

Please share your pbsnodes -aSj and chunk submission line , what job placement string do you use ?

No, there is no time frame, thousands of jobs are submitting within a minute for some usecases.

smp_cluster_dist is deprecated. please check the PBS Pro administrator guide.

Please check:
4.7 Specifying Job Placement in the PBS Professional User Guide https://www.altair.com/pdfs/pbsworks/PBSUserGuide19.2.3.pdf

and use the below node_sort_key
node_sort_key: “ncpus LOW assigned” ALL

Still “packs” all jobs on a node, then moves to next.

Maybe I have something wrong with my “Queue” settings? I didn’t do much on that part yet.

Queue: default
queue_type = Execution
Priority = 50
total_jobs = 15
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:15 Exiting:0 Begun:0
resources_max.mem = 24000mb
resources_max.ncpus = 16
resources_max.nodes = 4
resources_max.walltime = 96:00:00
resources_default.ncpus = 1
resources_default.walltime = 24:00:00
resources_assigned.mem = 360000mb
resources_assigned.mpiprocs = 120
resources_assigned.ncpus = 120
resources_assigned.nodect = 15
hasnodes = True
enabled = True
started = True

[~]# pbsnodes -a
compute-00
Mom = compute-00.local.cluster
Port = 15002
pbs_version = 19.1.1
ntype = PBS
state = free
pcpus = 72
jobs = 22669.hmrihpcp02/0, 22669.hmrihpcp02/1, 22669.hmrihpcp02/2, 22669.hmrihpcp02/3
resources_available.arch = linux
resources_available.host = compute-00
resources_available.mem = 394618508kb
resources_available.ncpus = 72
resources_available.vnode = compute-00
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 56
resources_assigned.vmem = 0kb
queue = default
resv_enable = True
sharing = default_shared
last_state_change_time = Mon May 18 10:29:21 2020
last_used_time = Mon May 18 10:29:21 2020

Thank you for sharing us the details

  1. Your queue settings are correct

  2. If you have only one compute node in the PBS Cluster

    • then all the jobs have to be packed within that node
  3. If you have more than one compute node , then you can see the jobs being scheduled on to another node when you have node_sort_key: “ncpus LOW assigned” ALL (kill -HUP < pid of the pbs_sched > based on the ncpus resources allocation.

Please share the output of pbsnodes -aSj and qmgr -c ‘p s’