PBS Hook for managing scratch custom host level resource

Hi,
I wrote a python PBS Hook to manage the scratch space disk on each host (i.e., each compute node). That means that each host has a given maximum allowed space on the scratch disk filesystem. Each job can reserve a portion of such disk space. If no hosts have the requested disk space while job submitting, that job is queued.

I created the scratch custom resource as followings:
qmgr -c “create resource scratch type=size, flag=h”
and I added it in the sched_config file.

The scratch custom resource is properly updated on the EXECHOST_PERIODIC event.

Everything works well with the exception of a very specific test case (in my simulation environment):

Suppose to have 2 jobs which reserve a scratch space each, so as the overall scratch space exceeds the maximum available scratch space for a given host (the other requested compute resources can be accomplished by one host instead).
Also, suppose that such jobs are submitted almost simultaneously.

Sometimes, the two jobs are wrongly executed on the same host. While scheduling the latter submitted job, the scheduler checks the scratch custom resource which has not updated yet on the EXECHOST_PERIODIC event.

For hosts with only one vnode I found the following workaround (which seems to me be fine): the scratch resource is also updated on the EXECJOB_BEGIN event. Note that the pbs.event().vnode_list contains the current vnode which is the unique available vnode on the host.
If the latter submitted job is wrongly scheduled, it is rejected and considered for execution in the next scheduling cycle.

For hosts with more vnodes the pbs.event().vnode_list contains the current vnode which does not correspond to the host, but the scratch custom resource is host level. Thus, I am not able to update properly such custom resource.

Do you have any hints?

Thanks in advance for helping.

  1. You can use the pbs.server().scheduler_restart_cycle() in your hook, so that once the scratch space is updated, the scheduler cycle is restarted where applicable.

  2. you can implement a runjob hook that checks the scratch space before being scheduled on to a compute node

Snippet from the PBS Professional Admin Guide 2020.1

Are you using config_v2 files for your vnodes ? Please share the pbsnodes -av output.

Hope the above helps.

Thanks for answering.
First of all in the followings you can find the pbsnodes -av output (in my simulation system. That is very similar to the one in my production system):

node02
     Mom = node02.cluster.local
     Port = 15002
     pbs_version = 20.0.1
     ntype = PBS
     state = free
     pcpus = 1
     resources_available.arch = linux
     resources_available.host = node02
     resources_available.mem = 0b
     resources_available.ncpus = 0
     resources_available.scratch = 20960256kb
     resources_available.vnode = node02
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     in_multivnode_host = 1
     last_state_change_time = Sun Jan  3 09:58:33 2021
     last_used_time = Mon Dec 28 11:37:01 2020

node01
     Mom = node01.cluster.local
     Port = 15002
     pbs_version = 20.0.1
     ntype = PBS
     state = free
     pcpus = 1
     resources_available.arch = linux
     resources_available.host = node01
     resources_available.mem = 0b
     resources_available.ncpus = 0
     resources_available.scratch = 20960256kb
     resources_available.vnode = node01
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     in_multivnode_host = 1
     last_state_change_time = Sun Jan  3 09:49:31 2021
     last_used_time = Sat Jan  2 10:49:20 2021

node01[0]
     Mom = node01.cluster.local
     Port = 15002
     pbs_version = 20.0.1
     ntype = PBS
     state = free
     pcpus = 1
     resources_available.arch = linux
     resources_available.host = node01
     resources_available.mem = 210134kb
     resources_available.ncpus = 1
     resources_available.vnode = node01[0]
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     in_multivnode_host = 1
     last_state_change_time = Sun Jan  3 09:49:31 2021
     last_used_time = Sat Jan  2 15:45:29 2021

node02[0]
     Mom = node02.cluster.local
     Port = 15002
     pbs_version = 20.0.1
     ntype = PBS
     state = free
     pcpus = 1
     resources_available.arch = linux
     resources_available.host = node02
     resources_available.mem = 210134kb
     resources_available.ncpus = 1
     resources_available.vnode = node02[0]
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     in_multivnode_host = 1
     last_state_change_time = Sun Jan  3 09:58:33 2021

node02[1]
     Mom = node02.cluster.local
     Port = 15002
     pbs_version = 20.0.1
     ntype = PBS
     state = free
     pcpus = 1
     resources_available.arch = linux
     resources_available.host = node02
     resources_available.mem = 210134kb
     resources_available.ncpus = 1
     resources_available.vnode = node02[1]
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     in_multivnode_host = 1
     last_state_change_time = Sun Jan  3 09:58:33 2021

node01[1]
     Mom = node01.cluster.local
     Port = 15002
     pbs_version = 20.0.1
     ntype = PBS
     state = free
     pcpus = 1
     resources_available.arch = linux
     resources_available.host = node01
     resources_available.mem = 210134kb
     resources_available.ncpus = 1
     resources_available.vnode = node01[1]
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     in_multivnode_host = 1
     last_state_change_time = Sun Jan  3 09:49:31 2021

Please, also note that the scratch is an host-level resource and each host contains a local small database (sqlite) which lists the amount of scratch reserved to each running job. Thus, if the scratch resource has not been updated yet, only such database has the information on the current available amount of scratch.

Regarding the possibility to restart the Scheduler Cycle I would have a question: “unless the hook’s executing user is a PBS Manager or Operator”, I need to include scheduler_restart_cycle into fail_action attribute of my execjob_begin hook in order to make the scheduler restart the cycle when an internal error occurs (e.g., the scratch space available is not enough). Is that right?

Following your suggestion on using the runjob event, I would have find the following workaround: in the case of hosts with more vnodes, the scratch custom resource is only updated on the exechost_periodic event.
On the execjob_begin event the hook checks if the requested scratch can be accomplished on the current host comparing it to the scratch information in the local database. If it cannot be, the job is rejected and a custom flag is set to TRUE in the Variable_List structure of that job (note that the flag contains the host name too). The job is killed and the scheduler is triggered.
On the subsequent run_job event, the flag is checked. Again, the job is rejected if the exec_node (i.e., the host the job is sent to) corresponds to the one saved in the flag. Thus the job is requeued and the scheduler considers it for execution in the next scheduling cycle. In the meantime the custom scratch resource is properly updated on the exechost_periodic event.

That seems to work for me.

Thanks for helping.

1 Like