Jobs stuck in Q stat and the scheduler seems not working right

Hi all,

I just installed the openpbs in a single server following the github INSTALL instructions. Everything looks fine except that all jobs stuck in Q state. I did some search and have no idea how to fix it. I guess the scheduler is not working right. Below is something may help to find any clue. Thank you for your time to help me.

$ /etc/init.d/pbs status
pbs_server is pid 10064
pbs_mom is pid 9871
pbs_sched is pid 9884
pbs_comm is 9861

$cat /etc/pbs.conf
PBS_SERVER=thirteen
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

$cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.9.9.13 thirteen

$ifconfig

enp28s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.9.9.13 netmask 255.255.255.0 broadcast 10.9.9.255
inet6 fe80::a3f9:6640:2012:c4ae prefixlen 64 scopeid 0x20
ether e8:61:1f:29:90:66 txqueuelen 1000 (Ethernet)
RX packets 69849 bytes 24324069 (23.1 MiB)
RX errors 0 dropped 2785 overruns 0 frame 0
TX packets 4643 bytes 1959986 (1.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xaae00000-aae1ffff

$hostname -f
thirteen

sestatus SELinux status: disabled nmap thirteen
Starting Nmap 6.40 ( http://nmap.org ) at 2020-11-20 08:53 CST
Nmap scan report for thirteen (10.9.9.13)
Host is up (0.0000050s latency).
Not shown: 996 closed ports
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
15002/tcp open unknown
15003/tcp open unknown

Nmap done: 1 IP address (1 host up) scanned in 0.12 seconds

$firewall-cmd --state
not running

$pbs_hostn -v thirteen
primary name: thirteen (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 10.9.9.13 (218695946 dec) name: thirteen

$qstat -a
thirteen:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


8.thirteen achen workq STDIN – 1 1 – – Q –
9.thirteen achen workq STDIN – 1 1 – – Q –

$ps -ef | grep pbs_
root 9861 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_comm
root 9871 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_mom
root 9884 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 9966 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 10063 10006 0 08:42 ? 00:00:00 postgres: postgres pbs_datastore 10.9.9.13(49620) idle
root 10064 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
root 11208 9398 0 08:55 pts/2 00:00:00 grep --color=auto pbs_

$ systemctl status pbs

  • pbs.service - Portable Batch System
    Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
    Active: inactive (dead) since Fri 2020-11-20 08:28:12 CST; 28min ago
    Docs: man:pbs(8)
    Process: 7479 ExecStop=/opt/pbs/libexec/pbs_init.d stop (code=exited, status=0/SUCCESS)
    Process: 1657 ExecStart=/opt/pbs/libexec/pbs_init.d start (code=exited, status=0/SUCCESS)

Nov 20 07:37:03 thirteen su[2466]: (to postgres) root on none
Nov 20 07:37:06 thirteen su[2520]: (to postgres) root on none
Nov 20 07:37:17 thirteen pbs_init.d[1657]: Starting PBS in background
Nov 20 08:28:06 thirteen su[7229]: (to postgres) root on none
Nov 20 08:28:07 thirteen su[7262]: (to postgres) root on none
Nov 20 08:28:10 thirteen su[7298]: (to postgres) root on none
Nov 20 08:28:10 thirteen pbs_init.d[7479]: Stopping PBS
Nov 20 08:28:11 thirteen su[7539]: (to postgres) root on none
Nov 20 08:28:11 thirteen su[7577]: (to postgres) root on none
Nov 20 08:28:11 thirteen pbs_init.d[7479]: Waiting for shutdown to complete

$pbsnodes -av
thirteen
Mom = thirteen
Port = 15002
pbs_version = 20.0.0
ntype = PBS
state = free
pcpus = 48
resources_available.arch = linux
resources_available.host = thirteen
resources_available.mem = 329653792kb
resources_available.ncpus = 48
resources_available.vnode = thirteen
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Fri Nov 20 08:56:41 2020

$ qstat -answ1

thirteen:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


8.thirteen achen workq STDIN – 1 1 – – Q – –

9.thirteen achen workq STDIN – 1 1 – – Q – –

$ ping thirteen
PING thirteen (10.9.9.13) 56(84) bytes of data.
64 bytes from thirteen (10.9.9.13): icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from thirteen (10.9.9.13): icmp_seq=2 ttl=64 time=0.019 ms

$nslookup thirteen
Server: 8.8.8.8
Address: 8.8.8.8#53

** server can’t find thirteen: NXDOMAIN

$cat /var/spool/pbs/sched_logs/20201120

11/20/2020 08:56:35;0002;pbs_sched;Svr;Log;Log opened
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;pbs_version=20.0.0
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;pbs_build=mach=N/A:security=N/A:configure_args=N/A
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;hostname=thirteen;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;ipv4 interface lo: localhost4.localdomain4
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;ipv4 interface enp28s0: thirteen
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;ipv6 interface lo: localhost6.localdomain6
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;ipv6 interface enp28s0: thirteen
11/20/2020 08:56:35;0002;pbs_sched;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
11/20/2020 08:56:35;0006;pbs_sched;Fil;pbs_sched;Version 20.0.0, started, initialization type = 0
11/20/2020 08:56:35;0002;pbs_sched;Svr;main;/opt/pbs/sbin/pbs_sched startup pid 11644
11/20/2020 08:56:35;0040;pbs_sched;Fil;sched_config;Error reading line 398:
11/20/2020 08:56:35;0040;pbs_sched;Fil;fairshare usage;Creating usage database for fairshare
11/20/2020 08:56:35;0080;pbs_sched;Req;;Launching 24 worker threads
11/20/2020 08:56:39;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:41;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:43;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:46;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:48;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:50;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:52;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:54;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:56;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:58;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers

By the way, I have to submit job using another account rather than root. Using root account to submit jobs leads to a BAD UID error.

Hope anyone could help. Please let me know if further details need to be posted here.

Looks like some recent changes may be causing problems. Did you use the master branch? This can be a bit unstable at times. Best to pull from one of the tags here: https://github.com/openpbs/openpbs/tags

The tags represent stable release points.

Thanks @mkaro for your kind reply.

I tried the 20.0.1 version, and have two further problems .
First, how could I uninstall the older version of open-pbs installed from source? I just re-install the 20.0.1 version and the pbs service could not start. Below is the message:
$/etc/init.d/pbs start
Starting PBS
PBS Home directory /var/spool/pbs needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.


Cannot upgrade PBS datastore version 1.5.0
Failed to upgrade PBS Datastore

I guess it’s because I didn’t uninstall the older version. Could you please give me some advice?

Second, I tried to install version 20.0.1 into another server. When I followed the exactly same instructions to install, I met a difficult problem:
$ make

libtool: link: gcc -g -O2 -o pbs_server.bin pbs_server_bin-accounting.o pbs_server_bin-array_func.o pbs_server_bin-attr_recov.o pbs_server_bin-attr_recov_db.o pbs_server_bin-checkkey.o pbs_server_bin-dis_read.o pbs_server_bin-failover.o pbs_server_bin-geteusernam.o pbs_server_bin-hook_func.o pbs_server_bin-issue_request.o pbs_server_bin-job_func.o pbs_server_bin-job_recov.o pbs_server_bin-job_recov_db.o pbs_server_bin-job_route.o pbs_server_bin-license_client.o pbs_server_bin-mom_info.o pbs_server_bin-node_func.o pbs_server_bin-node_manager.o pbs_server_bin-node_recov_db.o pbs_server_bin-pbsd_init.o pbs_server_bin-pbsd_main.o pbs_server_bin-process_request.o pbs_server_bin-queue_func.o pbs_server_bin-queue_recov_db.o pbs_server_bin-reply_send.o pbs_server_bin-req_delete.o pbs_server_bin-req_getcred.o pbs_server_bin-req_holdjob.o pbs_server_bin-req_jobobit.o pbs_server_bin-req_locate.o pbs_server_bin-req_manager.o pbs_server_bin-req_message.o pbs_server_bin-req_modify.o pbs_server_bin-req_preemptjob.o pbs_server_bin-req_movejob.o pbs_server_bin-req_quejob.o pbs_server_bin-req_register.o pbs_server_bin-req_rerun.o pbs_server_bin-req_rescq.o pbs_server_bin-req_runjob.o pbs_server_bin-req_select.o pbs_server_bin-req_shutdown.o pbs_server_bin-req_signal.o pbs_server_bin-req_stat.o pbs_server_bin-req_track.o pbs_server_bin-req_cred.o pbs_server_bin-resc_attr.o pbs_server_bin-resv_attr.o pbs_server_bin-run_sched.o pbs_server_bin-sched_func.o pbs_server_bin-setup_resc.o pbs_server_bin-stat_job.o pbs_server_bin-svr_attr.o pbs_server_bin-svr_chk_owner.o pbs_server_bin-svr_connect.o pbs_server_bin-svr_func.o pbs_server_bin-svr_jobfunc.o pbs_server_bin-svr_mail.o pbs_server_bin-svr_migrate_data.o pbs_server_bin-svr_movejob.o pbs_server_bin-svr_recov_db.o pbs_server_bin-svr_resccost.o pbs_server_bin-svr_credfunc.o pbs_server_bin-user_func.o pbs_server_bin-vnparse.o …/…/src/lib/Libtpp/libtpp.a …/…/src/lib/Libattr/libattr.a …/…/src/lib/Libutil/libutil.a …/…/src/lib/Liblog/liblog.a …/…/src/lib/Libnet/libnet.a …/…/src/lib/Libsec/libsec.a …/…/src/lib/Libsite/libsite.a …/…/src/lib/Libpython/libpbspython_svr.a …/…/src/lib/Libdb/libdb.a …/…/src/lib/Libpbs/.libs/libpbs.a -lpq /home/achen/anaconda3/lib/libexpat.so -L/home/achen/anaconda3/lib -lz -lical -L/home/achen/anaconda3/lib/python3.7/config-3.7m-x86_64-linux-gnu -lpython3.7m -lpthread -lutil -lrt -lm -lssl -lcrypto -ldl -lcrypt -lc -Wl,-rpath -Wl,/home/achen/anaconda3/lib -Wl,-rpath -Wl,/home/achen/anaconda3/lib
/usr/bin/ld: warning: libssl.so.10, needed by /usr/lib/gcc/x86_64-redhat-linux/4.8.5/…/…/…/…/lib64/libpq.so, may conflict with libssl.so.1.1
/usr/bin/ld: warning: libcrypto.so.10, needed by /usr/lib/gcc/x86_64-redhat-linux/4.8.5/…/…/…/…/lib64/libpq.so, may conflict with libcrypto.so.1.1
/usr/bin/ld: …/…/src/lib/Libpbs/.libs/libpbs.a(libpbs_la-pbs_aes_encrypt.o): undefined reference to symbol ‘EVP_CIPHER_CTX_init@@libcrypto.so.10
/usr/lib64/libcrypto.so.10: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [pbs_server.bin] Error 1
make[2]: Leaving directory /home/achen/openpbs-20.0.1/src/server' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory /home/achen/openpbs-20.0.1/src’
make: *** [all-recursive] Error 1

Hope you could give me some advice, thanks a lot!

Could you please give me some advice when you have time? :sob: Thanks~

If this is a new installation, and you don’t need to preserve anything from previous installations, the easiest thing to do is to stop PBS and then remove /var/spool/pbs completely. Upon restart, PBS will run the pbs_habitat script to recreate the necessary directories and files. I think the reason you are seeing a problem is that the DB upgrade script only works when migrating from an older version to a newer version of the database, and you are doing the opposite by reverting back to 20.0.1.

If you have settings or jobs you need to preserve, my suggestion is to drain the queues and to capture the output of: qmgr -c “print server” > /tmp/my_settings

This can then be fed back into qmgr when the 20.0.1 instance is started: cat /tmp/my_settings | qmgr