Problems with the scheduling of the jobs (OpenPBS 19.1.3)

asalvad · April 23, 2021, 7:29am

Hello,
we are experiencing some problems with the scheduling of the jobs in OpenPBS 19.1.3 on Centos 7. In particular, from time to time, jobs are kept in queue even when there are enough free resources.

We have 3 compute nodes, called daneel01-03, equipped with 72 cpus, 4 GPUs and 1.5TB of RAM. These nodes are associated with two queues, q07daneel and q14daneel, defined as follows:

create queue q07daneel
set queue q07daneel queue_type = Execution
set queue q07daneel max_queued = [u:PBS_GENERIC=10]
set queue q07daneel acl_host_enable = False
set queue q07daneel acl_user_enable = False
set queue q07daneel resources_max.walltime = 168:00:00
set queue q07daneel resources_min.walltime = 00:00:00
set queue q07daneel acl_group_enable = True
set queue q07daneel acl_groups = +
set queue q07daneel default_chunk.Qlist = daneel
set queue q07daneel max_run_res.nodect = [u:PBS_GENERIC=2]
set queue q07daneel enabled = True
set queue q07daneel started = True

create queue q14daneel
set queue q14daneel queue_type = Execution
set queue q14daneel max_queued = [u:PBS_GENERIC=10]
set queue q14daneel resources_max.walltime = 336:00:00
set queue q14daneel resources_min.walltime = 168:00:00
set queue q14daneel acl_group_enable = True
set queue q14daneel acl_groups = +
set queue q14daneel default_chunk.Qlist = daneel
set queue q14daneel max_run_res.nodect = [u:PBS_GENERIC=2]
set queue q14daneel enabled = True
set queue q14daneel started = True

We also enabled the cgroups plugin.

As an example of the problems, consider the following set of running jobs:

DANEEL01

Job ID: 35629
Queue: q14daneel
Resources: select=1; ncpus=1; ngpus=0; mem=2048mb;

Summary of the assigned resources:
ncpus: 1/72; gpus: 0/4; RAM: 2gb / 1.5tb

DANEEL02

Job ID: 35391
Queue: q14daneel
Resources: select=1; ncpus=12; mpiprocs=12; mem=32gb

Job ID: 35409
Queue: q14daneel
Resources: select=1; ncpus=4; mpiprocs=4; mem=32gb

Job ID: 36267
Queue: q14daneel
Resources: select=1; ncpus=2; mem=32768mb

Job ID: 36275
Queue: q07daneel
Resources: select=1; ncpus=1; mem=512mb

Job ID: 36355
Queue: q07daneel
Resources: select=1; ncpus=32; ngpus=4; mem=800gb

Job ID: 36433
Queue: q14daneel
Resources: select=1; ncpus=4; mem=32768mb

Summary of the assigned resources:
ncpus: 55/72; gpus: 4/4; RAM: 928.5gb / 1.5tb

DANEEL03

Job ID: 36284
Queue: q07daneel
Resources: select=1; ncpus=4; ngpus=1 (defaults: mem=2048mb)

Job ID: 36428
Queue: q07daneel
Resources: select=1; ncpus=2; ngpus=1 (defaults: mem=2048mb)

Summary of the assigned resources:
ncpus:6/72; gpus:2/4; RAM: 4gb/1.5tb

Given this situation, PBS put the following jobs in queue:

Job ID: 35732
Queue: q07daneel
Resources: select=1; ncpus=72; ngpus=4; mem=1450gb
PBS comment: Insufficient amount of resource: Qlist

Job ID: 36103
Queue: q07daneel
Resources: select=1; ncpus=70; mem=64gb
PBS comment: Insufficient amount of resource: Qlist

Job ID: 36357
Queue: q07daneel
Resources: select=1; ncpus=40; ngpus=1; mem=700gb
PBS comment: Queue q07daneel per-user limit reached on resource nodect

Job ID: 36416
Queue: q14daneel
Resources: select=2; ncpus=60; mpiprocs=60; mem=150GB
PBS comment: Insufficient amount of resource: Qlist

Job 35732 is rightly in queue, since there were no nodes with 72 free cpus.

Job 36103, on the other hand, could have been run on daneel01, but PBS kept it in queue.
Similarly, 36416 could have been run on daneel01 and daneel03.

As regard to 36357, PBS complained about the fact that the user reached the max_run_res.nodect limit. However, at that moment, the user had only one job running, while the limit is set to 2 running jobs.

Finally, when inspecting the daneel03’s resources with “pbsnodes daneel03”, PBS reported that no memory was assigned:

resources_assigned.mem = 0kb

Do you have any idea about the possible causes of these strange behaviors?

Thank you in advance for your help!

adarsh · April 25, 2021, 7:21pm

When you see such scenarios, please follow the workflow:

Please increase the PBS Scheduler log level to the max and kill -HUP < pid of the pbs_sched >
watch the PBS Scheduler log via tail -f
initiate a scheduling cycle qmgr: set server scheduling=t
follow the jobid that you would expect to run in the scheduler log ,you would find the reason
Note: turn the scheduler log level back to default value and kill -HUP < pid of pbs_sched >

asalvad · April 26, 2021, 8:04am

Ok, i’ll try!
Thank you for the suggestions!

asalvad · April 29, 2021, 7:32am

@adarsh By following your instructions we managed to figure out why some jobs were kept in queue.
Thank you again for your help!
Regards

Topic		Replies	Views
Job stuck in queue, multiple servers Users/Site Administrators	5	1022	September 14, 2022
Some jobs stay queued for extended periods of time despite availability of hosts Users/Site Administrators	6	96	April 11, 2025
CPU + GPU jobs on nodes Users/Site Administrators	1	30	June 4, 2025
Optimizing OpenPBS Job Scheduling in a Heterogeneous HPC Environment Users/Site Administrators	2	81	August 4, 2024
Jobs stuck in queue, need to use root access to qrun to initiate each job Users/Site Administrators	6	1772	January 11, 2023

Problems with the scheduling of the jobs (OpenPBS 19.1.3)

DANEEL01

DANEEL02

DANEEL03

Related topics