Hello,
we are experiencing some problems with the scheduling of the jobs in OpenPBS 19.1.3 on Centos 7. In particular, from time to time, jobs are kept in queue even when there are enough free resources.
We have 3 compute nodes, called daneel01-03, equipped with 72 cpus, 4 GPUs and 1.5TB of RAM. These nodes are associated with two queues, q07daneel and q14daneel, defined as follows:
create queue q07daneel
set queue q07daneel queue_type = Execution
set queue q07daneel max_queued = [u:PBS_GENERIC=10]
set queue q07daneel acl_host_enable = False
set queue q07daneel acl_user_enable = False
set queue q07daneel resources_max.walltime = 168:00:00
set queue q07daneel resources_min.walltime = 00:00:00
set queue q07daneel acl_group_enable = True
set queue q07daneel acl_groups = +
set queue q07daneel default_chunk.Qlist = daneel
set queue q07daneel max_run_res.nodect = [u:PBS_GENERIC=2]
set queue q07daneel enabled = True
set queue q07daneel started = True
create queue q14daneel
set queue q14daneel queue_type = Execution
set queue q14daneel max_queued = [u:PBS_GENERIC=10]
set queue q14daneel resources_max.walltime = 336:00:00
set queue q14daneel resources_min.walltime = 168:00:00
set queue q14daneel acl_group_enable = True
set queue q14daneel acl_groups = +
set queue q14daneel default_chunk.Qlist = daneel
set queue q14daneel max_run_res.nodect = [u:PBS_GENERIC=2]
set queue q14daneel enabled = True
set queue q14daneel started = True
We also enabled the cgroups plugin.
As an example of the problems, consider the following set of running jobs:
DANEEL01
Job ID: 35629
Queue: q14daneel
Resources: select=1; ncpus=1; ngpus=0; mem=2048mb;
Summary of the assigned resources:
ncpus: 1/72; gpus: 0/4; RAM: 2gb / 1.5tb
DANEEL02
Job ID: 35391
Queue: q14daneel
Resources: select=1; ncpus=12; mpiprocs=12; mem=32gb
Job ID: 35409
Queue: q14daneel
Resources: select=1; ncpus=4; mpiprocs=4; mem=32gb
Job ID: 36267
Queue: q14daneel
Resources: select=1; ncpus=2; mem=32768mb
Job ID: 36275
Queue: q07daneel
Resources: select=1; ncpus=1; mem=512mb
Job ID: 36355
Queue: q07daneel
Resources: select=1; ncpus=32; ngpus=4; mem=800gb
Job ID: 36433
Queue: q14daneel
Resources: select=1; ncpus=4; mem=32768mb
Summary of the assigned resources:
ncpus: 55/72; gpus: 4/4; RAM: 928.5gb / 1.5tb
DANEEL03
Job ID: 36284
Queue: q07daneel
Resources: select=1; ncpus=4; ngpus=1 (defaults: mem=2048mb)
Job ID: 36428
Queue: q07daneel
Resources: select=1; ncpus=2; ngpus=1 (defaults: mem=2048mb)
Summary of the assigned resources:
ncpus:6/72; gpus:2/4; RAM: 4gb/1.5tb
Given this situation, PBS put the following jobs in queue:
Job ID: 35732
Queue: q07daneel
Resources: select=1; ncpus=72; ngpus=4; mem=1450gb
PBS comment: Insufficient amount of resource: Qlist
Job ID: 36103
Queue: q07daneel
Resources: select=1; ncpus=70; mem=64gb
PBS comment: Insufficient amount of resource: Qlist
Job ID: 36357
Queue: q07daneel
Resources: select=1; ncpus=40; ngpus=1; mem=700gb
PBS comment: Queue q07daneel per-user limit reached on resource nodect
Job ID: 36416
Queue: q14daneel
Resources: select=2; ncpus=60; mpiprocs=60; mem=150GB
PBS comment: Insufficient amount of resource: Qlist
Job 35732 is rightly in queue, since there were no nodes with 72 free cpus.
Job 36103, on the other hand, could have been run on daneel01, but PBS kept it in queue.
Similarly, 36416 could have been run on daneel01 and daneel03.
As regard to 36357, PBS complained about the fact that the user reached the max_run_res.nodect limit. However, at that moment, the user had only one job running, while the limit is set to 2 running jobs.
Finally, when inspecting the daneel03’s resources with “pbsnodes daneel03”, PBS reported that no memory was assigned:
resources_assigned.mem = 0kb
Do you have any idea about the possible causes of these strange behaviors?
Thank you in advance for your help!