Hi all,
I just installed the openpbs in a single server following the github INSTALL instructions. Everything looks fine except that all jobs stuck in Q state. I did some search and have no idea how to fix it. I guess the scheduler is not working right. Below is something may help to find any clue. Thank you for your time to help me.
$ /etc/init.d/pbs status
pbs_server is pid 10064
pbs_mom is pid 9871
pbs_sched is pid 9884
pbs_comm is 9861
$cat /etc/pbs.conf
PBS_SERVER=thirteen
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
$cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.9.9.13 thirteen
$ifconfig
…
enp28s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.9.9.13 netmask 255.255.255.0 broadcast 10.9.9.255
inet6 fe80::a3f9:6640:2012:c4ae prefixlen 64 scopeid 0x20
ether e8:61:1f:29:90:66 txqueuelen 1000 (Ethernet)
RX packets 69849 bytes 24324069 (23.1 MiB)
RX errors 0 dropped 2785 overruns 0 frame 0
TX packets 4643 bytes 1959986 (1.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xaae00000-aae1ffff
…
$hostname -f
thirteen
sestatus
SELinux status: disabled
nmap thirteen
Starting Nmap 6.40 ( http://nmap.org ) at 2020-11-20 08:53 CST
Nmap scan report for thirteen (10.9.9.13)
Host is up (0.0000050s latency).
Not shown: 996 closed ports
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
15002/tcp open unknown
15003/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.12 seconds
$firewall-cmd --state
not running
$pbs_hostn -v thirteen
primary name: thirteen (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 10.9.9.13 (218695946 dec) name: thirteen
$qstat -a
thirteen:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
8.thirteen achen workq STDIN – 1 1 – – Q –
9.thirteen achen workq STDIN – 1 1 – – Q –
…
$ps -ef | grep pbs_
root 9861 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_comm
root 9871 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_mom
root 9884 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 9966 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 10063 10006 0 08:42 ? 00:00:00 postgres: postgres pbs_datastore 10.9.9.13(49620) idle
root 10064 1 0 08:42 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
root 11208 9398 0 08:55 pts/2 00:00:00 grep --color=auto pbs_
$ systemctl status pbs
- pbs.service - Portable Batch System
Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
Active: inactive (dead) since Fri 2020-11-20 08:28:12 CST; 28min ago
Docs: man:pbs(8)
Process: 7479 ExecStop=/opt/pbs/libexec/pbs_init.d stop (code=exited, status=0/SUCCESS)
Process: 1657 ExecStart=/opt/pbs/libexec/pbs_init.d start (code=exited, status=0/SUCCESS)
Nov 20 07:37:03 thirteen su[2466]: (to postgres) root on none
Nov 20 07:37:06 thirteen su[2520]: (to postgres) root on none
Nov 20 07:37:17 thirteen pbs_init.d[1657]: Starting PBS in background
Nov 20 08:28:06 thirteen su[7229]: (to postgres) root on none
Nov 20 08:28:07 thirteen su[7262]: (to postgres) root on none
Nov 20 08:28:10 thirteen su[7298]: (to postgres) root on none
Nov 20 08:28:10 thirteen pbs_init.d[7479]: Stopping PBS
Nov 20 08:28:11 thirteen su[7539]: (to postgres) root on none
Nov 20 08:28:11 thirteen su[7577]: (to postgres) root on none
Nov 20 08:28:11 thirteen pbs_init.d[7479]: Waiting for shutdown to complete
$pbsnodes -av
thirteen
Mom = thirteen
Port = 15002
pbs_version = 20.0.0
ntype = PBS
state = free
pcpus = 48
resources_available.arch = linux
resources_available.host = thirteen
resources_available.mem = 329653792kb
resources_available.ncpus = 48
resources_available.vnode = thirteen
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Fri Nov 20 08:56:41 2020
$ qstat -answ1
thirteen:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
8.thirteen achen workq STDIN – 1 1 – – Q – –
–
9.thirteen achen workq STDIN – 1 1 – – Q – –
$ ping thirteen
PING thirteen (10.9.9.13) 56(84) bytes of data.
64 bytes from thirteen (10.9.9.13): icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from thirteen (10.9.9.13): icmp_seq=2 ttl=64 time=0.019 ms
…
$nslookup thirteen
Server: 8.8.8.8
Address: 8.8.8.8#53
** server can’t find thirteen: NXDOMAIN
$cat /var/spool/pbs/sched_logs/20201120
…
11/20/2020 08:56:35;0002;pbs_sched;Svr;Log;Log opened
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;pbs_version=20.0.0
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;pbs_build=mach=N/A:security=N/A:configure_args=N/A
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;hostname=thirteen;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;ipv4 interface lo: localhost4.localdomain4
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;ipv4 interface enp28s0: thirteen
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;ipv6 interface lo: localhost6.localdomain6
11/20/2020 08:56:35;0002;pbs_sched;Svr;pbs_sched;ipv6 interface enp28s0: thirteen
11/20/2020 08:56:35;0002;pbs_sched;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
11/20/2020 08:56:35;0006;pbs_sched;Fil;pbs_sched;Version 20.0.0, started, initialization type = 0
11/20/2020 08:56:35;0002;pbs_sched;Svr;main;/opt/pbs/sbin/pbs_sched startup pid 11644
11/20/2020 08:56:35;0040;pbs_sched;Fil;sched_config;Error reading line 398:
11/20/2020 08:56:35;0040;pbs_sched;Fil;fairshare usage;Creating usage database for fairshare
11/20/2020 08:56:35;0080;pbs_sched;Req;;Launching 24 worker threads
11/20/2020 08:56:39;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:41;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:43;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:46;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:48;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:50;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:52;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:54;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:56;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
11/20/2020 08:56:58;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn’t register the scheduler default with the configured servers
…
By the way, I have to submit job using another account rather than root. Using root account to submit jobs leads to a BAD UID error.
Hope anyone could help. Please let me know if further details need to be posted here.