"Not Running: No available resources on nodes" even when every core is 'free' on cluster

samdcurtis · May 20, 2024, 8:31pm

After restarting PBSPro (2022.1) it appears that the scheduler, server, and comm are down. When I submit job arrays it is unable to allocate resources to each job even though no jobs are running.

All nodes are free:
pbsnodes -aSj
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs

kvn01 free 0 0 0 252gb/252gb 64/64 0/0 0/0 –
kvn02 free 0 0 0 252gb/252gb 64/64 0/0 0/0 –
kvn03 free 0 0 0 252gb/252gb 64/64 0/0 0/0 –
kvn04 free 0 0 0 252gb/252gb 64/64 0/0 0/0 –
kvn05 free 0 0 0 252gb/252gb 64/64 0/0 0/0 –
kvn06 free 0 0 0 252gb/252gb 64/64 0/0 0/0 –
kvn07 free 0 0 0 252gb/252gb 64/64 0/0 0/0 –
kvn08 free 0 0 0 252gb/252gb 64/64 0/0 0/0 –

But when i submit a job array I get “comment = Not Running: No available resources on nodes” and it gets put in the Q state indefinitely

qstat -f
Job Id: 17158.kvh.cm.cluster
Job_Name = test.sh
Job_Owner = scurti15@kvh.cm.cluster
job_state = Q
queue = workq
server = kvh.cm.cluster
Checkpoint = u
ctime = Mon May 20 13:42:26 2024
Error_Path = kvh.cm.cluster:/data/home/scurti15/scripts/pbs/
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon May 20 13:42:26 2024
Output_Path = kvh.cm.cluster:/data/home/scurti15/scripts/pbs/
Priority = 0
qtime = Mon May 20 13:42:26 2024
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.select = 1:ncpus=1
substate = 10
Variable_List = PBS_O_HOME=/data/home/scurti15,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=scurti15,
PBS_O_PATH=/data/home/scurti15/anaconda3/bin:/data/home/scurti15/anaco
nda3/condabin:/cm/local/apps/gcc/11.2.0/bin:/data/home/scurti15/.local/
bin:/data/home/scurti15/bin:/cm/local/apps/environment-modules/4.5.3/bi
n:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/sbin:/cm
/local/apps/environment-modules/4.5.3/bin:/opt/pbs/bin:/data/home/scurt
i15/anaconda3/lib:/data/home/scurti15/anaconda3/bin:/data/home/scurti15
/tools:/data/home/scurti15/tools/samtools:/data/home/scurti15/tools/bed
tools:/data/home/scurti15/tools/fastqc:/data/home/scurti15/tools/gatk:/
data/home/scurti15/tools/samtools-1.17:/data/home/scurti15/tools/htslib
-1.17:/data/home/scurti15/tools/ucsc:/usr/include,
PBS_O_MAIL=/var/spool/mail/scurti15,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=kvh.cm.cluster,
PBS_O_WORKDIR=/data/home/scurti15/scripts/fragmentomics,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq
comment = Not Running: No available resources on nodes
etime = Mon May 20 13:42:26 2024
Submit_arguments = gc_content_wgs.sh
array = True
array_state_count = Queued:90 Running:0 Exiting:0 Expired:0
array_indices_submitted = 1-90
array_indices_remaining = 1-90
project = _pbs_project_default
Submit_Host = kvh.cm.cluster
max_run_subjobs = 45

It seems like the server, scheduler, and comm are down:

/etc/init.d/pbs status
pbs_server is not running
pbs_sched is not running
pbs_comm is not running

The server logs show the following:

05/20/2024 16:18:10;0006;Server@kvh;Svr;Server@kvh;PBSProNodes feature not found, checking for PBSProSockets
05/20/2024 16:18:10;0006;Server@kvh;Svr;Server@kvh;NetlibToCCheckout error. Error Code: 9
Error String: Feature: PBSProSockets
Error Code: 9
Error Description:
[NETWORK] 6200@kvh.cm.cluster - (Err: 9) Feature not found
License Path: 6200@kvh.cm.cluster
: (null)
05/20/2024 16:18:11;0100;Server@kvh;Req;;Type 0 request received from scurti15@kvh.cm.cluster, sock=10
05/20/2024 16:18:11;0100;Server@kvh;Req;;Type 95 request received from scurti15@kvh.cm.cluster, sock=22
05/20/2024 16:18:11;0100;Server@kvh;Req;;Type 21 request received from scurti15@kvh.cm.cluster, sock=10
05/20/2024 16:18:11;0100;Server@kvh;Req;;Type 21 request processed from scurti15@kvh.cm.cluster, sock=10
05/20/2024 16:18:11;0100;Server@kvh;Req;;Type 19 request received from scurti15@kvh.cm.cluster, sock=10
05/20/2024 16:18:11;0100;Server@kvh;Req;;Type 19 request processed from scurti15@kvh.cm.cluster, sock=10

SELinux is disabled:

SELinux status: disabled

The server is resolvable:

pbs_hostn -v kvh
primary name: kvh.cm.cluster (from gethostbyname())
aliases: kvh
aliases: master.cm.cluster
aliases: master
aliases: localmaster.cm.cluster
aliases: localmaster
aliases: ldapserver.cm.cluster
aliases: ldapserver
address length: 4 bytes
address: 10.141.255.254 (4278160650 dec) name: kvh.cm.cluster

When I try to get the status of PBS using systemctl status pbs it prints nothing

I am able to ping the host

ping kvh
PING kvh.cm.cluster (10.141.255.254) 56(84) bytes of data.
64 bytes from kvh.cm.cluster (10.141.255.254): icmp_seq=1 ttl=64 time=0.050 ms
64 bytes from kvh.cm.cluster (10.141.255.254): icmp_seq=2 ttl=64 time=0.047 ms
64 bytes from kvh.cm.cluster (10.141.255.254): icmp_seq=3 ttl=64 time=0.047 ms
64 bytes from kvh.cm.cluster (10.141.255.254): icmp_seq=4 ttl=64 time=0.021 ms
^C
— kvh.cm.cluster ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3060ms
rtt min/avg/max/mdev = 0.021/0.041/0.050/0.012 ms

nslookup kvh
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: kvh.cm.cluster
Address: 10.141.255.254

Could use any help to get the nodes up and running. Thank you!

adarsh · May 21, 2024, 10:46am

Please share the output of the below commands as root user from the PBS Server host:
qstat -Bf
pbsnodes -av

samdcurtis · May 22, 2024, 12:54pm

[root@kvh ~]# pbsnodes -av
kvn01
Mom = kvn01.cm.cluster
Port = 15002
pbs_version = 2022.1.2.20221214134647
ntype = PBS
state = free
pcpus = 64
resources_available.arch = linux
resources_available.host = kvn01
resources_available.mem = 263723216kb
resources_available.ncpus = 64
resources_available.vnode = kvn01
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Mon May 20 13:23:07 2024
last_used_time = Sun May 19 16:29:36 2024
server_instance_id = kvh.cm.cluster:15001

[root@kvh ~]# qstat -Bf
Server: kvh.cm.cluster
server_state = Active
server_host = kvh.cm.cluster
scheduling = True
total_jobs = 45003
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
default_queue = workq
log_events = 511
mailer = /usr/sbin/sendmail
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
scheduler_iteration = 600
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_info = 6200@kvh.cm.cluster
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:0 Avail_Local:0 Used:0 High_Use:0
pbs_version = 2022.1.2.20221214134647
eligible_time_enable = False
job_history_enable = True
max_concurrent_provision = 5
power_provisioning = False
max_job_sequence_id = 9999999

Source · May 22, 2024, 2:32pm

Seems the server can’t get license from the lmx-server. license_count is 0.

Topic		Replies	Views
No available resources on nodes Users/Site Administrators	1	746	April 12, 2023
Schedulers doesn't seem to be holding jobs Users/Site Administrators	11	1634	June 18, 2019
Some jobs stay queued for extended periods of time despite availability of hosts Users/Site Administrators	7	134	July 8, 2025
Jobs were not dispatched even though there were sufficient nodes and sufficient resources for appropriate node_pool Users/Site Administrators	5	71	September 9, 2024
GPU queue not runing jobs Users/Site Administrators	19	1737	April 5, 2023

"Not Running: No available resources on nodes" even when every core is 'free' on cluster

Related topics