Job stack in queue after fresh install | Permission error 15008

Dear all,

I’m facing a similar problem to the one mentioned in Job gets stuck in a queue after a fresh install, however the advice given there was not able to resolve it and in some of the details, the behaviour differs.

I have installed PBS (master version from Github, version passing the CI test in Travis) on Ubuntu 20.04 using the commands mentioned in INSTALL. The meant use is to use the PBS on a local machine to debug the scripts before submitting them on cluster and to schedule the jobs on the very same local machine.

I was able to configure users, queues, etc. The behavior is quite similar to the one in Job gets stuck in a queue after a fresh install - with only difference, the log files seems to suggest there the PBS is unable to authenticate (Error 15008), unfortunately I was unable to resolve this issue.

badin@fermi:~$ sudo /etc/init.d/pbs start
Starting PBS
/opt/pbs/sbin/pbs_comm ready (pid=457933), Proxy Name:fermi:17001, Threads:4
PBS comm
PBS mom
PBS sched
pgrep: cannot allocate 4611686018427387903 bytes
Connecting to PBS dataservice...connected to PBS dataservice@fermi
Licenses valid for 1000000 Floating hosts
PBS server

Everything seems to work …

badin@fermi:~$ sudo /etc/init.d/pbs status
pbs_server is pid 458087
pbs_mom is pid 457943
pbs_sched is pid 457955
pbs_comm is 457933

However, after submitting a simple job into queue, the job stays in the queue …

badin@fermi:~/...$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1012.fermi       job_name          badin                    0 Q batch 

and the server, scheduler and communicator are sometimes down, sometimes stay up but the job stays in the queue.

server_logs:

02/01/2021 15:10:41;0002;Server@fermi;Svr;Log;Log opened
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;pbs_version=20.0.0
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;pbs_build=mach=N/A:security=N/A:configure_args=N/A
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;hostname=fermi;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;ipv4 interface lo: localhost 
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;ipv4 interface enp4s0: fermi 
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;ipv6 interface lo: ip6-loopback 
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;ipv6 interface enp4s0: fermi 
02/01/2021 15:10:41;0006;Server@fermi;Fil;Server@fermi;Version 20.0.0, started, initialization type = 1
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;pbs_status_db exit code 1
02/01/2021 15:10:41;0002;Server@fermi;Svr;Server@fermi;Starting PBS dataservice
02/01/2021 15:10:44;0002;Server@fermi;Svr;Server@fermi;connected to PBS dataservice@fermi
02/01/2021 15:10:44;0086;Server@fermi;Svr;pbs_python_ext_quick_start_interpreter;--> Python Interpreter quick started, compiled with version:'3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0]' <--
02/01/2021 15:10:44;0086;Server@fermi;Svr;pbs_python_ext_quick_start_interpreter;--> Inserted Altair PBS Python modules dir '/opt/pbs/lib/python/altair' '/opt/pbs/lib/python/altair/pbs/v1'<--
02/01/2021 15:10:44;0002;Server@fermi;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
02/01/2021 15:10:44;0000;Server@fermi;Svr;Server@fermi;Supported authentication method: resvport
02/01/2021 15:10:44;0004;Server@fermi;Svr;Server@fermi;node_fail_requeue value changed to 310
02/01/2021 15:10:44;0004;Server@fermi;Svr;Server@fermi;svr_max_job_sequence_id set to val 9999999
02/01/2021 15:10:44;0004;Server@fermi;Req;default;'throughput_mode' is being deprecated, it is recommended to use 'job_run_wait'
02/01/2021 15:10:44;0004;Server@fermi;Svr;Server@fermi;Licenses valid for 1000000 Floating hosts
02/01/2021 15:10:44;0002;Server@fermi;Svr;Act;Account file /var/spool/p
2/01/2021 15:10:44;0086;Server@fermi;Svr;Server@fermi;Recovered queue batch
02/01/2021 15:10:44;0086;Server@fermi;Svr;Server@fermi;Recovered queue prior
02/01/2021 15:10:44;0100;Server@fermi;Job;1012.fermi;enqueuing into batch, state Q hop 1
02/01/2021 15:10:44;0086;Server@fermi;Job;1012.fermi;Requeueing job, substate: 10 Requeued in queue: batch
02/01/2021 15:10:44;0080;Server@fermi;Svr;Server@fermi;No jobs to open
02/01/2021 15:10:44;0002;Server@fermi;Svr;Server@fermi;Recovered 1 jobs
02/01/2021 15:10:44;0086;Server@fermi;Svr;Server@fermi;Found hook PBS_cray_atom type=pbs
02/01/2021 15:10:44;0086;Server@fermi;Svr;Server@fermi;Found hook PBS_power type=pbs
02/01/2021 15:10:44;0086;Server@fermi;Svr;Server@fermi;Found hook PBS_alps_inventory_check type=pbs
02/01/2021 15:10:44;0086;Server@fermi;Svr;Server@fermi;Found hook pbs_cgroups type=site
02/01/2021 15:10:44;0080;Server@fermi;Hook;print_hook;ALLHOOKS hook[0] = {PBS_cray_atom, order=100, type=1, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_end), alarm=300, freq=120}
02/01/2021 15:10:44;0080;Server@fermi;Hook;print_hook;ALLHOOKS hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
02/01/2021 15:10:44;0080;Server@fermi;Hook;print_hook;ALLHOOKS hook[2] = {PBS_alps_inventory_check, order=1, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(exechost_periodic), alarm=90, freq=300}
02/01/2021 15:10:44;0080;Server@fermi;Hook;print_hook;ALLHOOKS hook[3] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
02/01/2021 15:10:44;0080;Server@fermi;Hook;print_hook;periodic hook[0] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
02/01/2021 15:10:44;0086;Server@fermi;Svr;pbs_python_ext_quick_shutdown_interpreter;--> Stopping Python interpreter <--
02/01/2021 15:10:44;0d80;Server@fermi;TPP;Server@fermi(Main Thread);TPP authentication method = resvport
02/01/2021 15:10:44;0c06;Server@fermi;TPP;Server@fermi(Main Thread);TPP leaf node names = 192.168.0.100:15001,127.0.0.1:15001,192.168.0.100:15001
02/01/2021 15:10:44;0d80;Server@fermi;TPP;Server@fermi(Main Thread);Initializing TPP transport Layer
02/01/2021 15:10:44;0d80;Server@fermi;TPP;Server@fermi(Main Thread);Max files allowed = 16384
02/01/2021 15:10:44;0d80;Server@fermi;TPP;Server@fermi(Main Thread);TPP initialization done
02/01/2021 15:10:44;0d80;Server@fermi;TPP;Server@fermi(Main Thread);Connecting to pbs_comm fermi:17001
02/01/2021 15:10:44;0002;Server@fermi;Svr;Server@fermi;Server pid = 471827 ready;  using ports Server:15001 MOM:15002 RM:15003
02/01/2021 15:10:44;0c06;Server@fermi;TPP;Server@fermi(Thread 0);Thread ready
02/01/2021 15:10:44;0c06;Server@fermi;TPP;Server@fermi(Thread 0);Registering address 192.168.0.100:15001 to pbs_comm fermi:17001
02/01/2021 15:10:44;0c06;Server@fermi;TPP;Server@fermi(Thread 0);Connected to pbs_comm fermi:17001
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;BEGIN setting up all resource attributes 
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;DONE setting up all resource attributes, number set <51>
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;BEGIN setting up all queue attributes 
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;DONE setting up all queue attributes, number set <56>
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;BEGIN setting up all job attributes 
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;DONE setting up all job attributes, number set <110>
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;BEGIN setting up all server attributes 
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;DONE setting up all server attributes, number set <101>
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;BEGIN setting up all reservation attributes 
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;DONE setting up all reservation attributes, number set <48>
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;BEGIN setting up all vnode attributes 
02/01/2021 15:10:44;0106;Server@fermi;Svr;Server@fermi;DONE setting up all vnode attributes, number set <36>
02/01/2021 15:10:44;0080;Server@fermi;Svr;Server@fermi;successfully set up signal.default_int_handler
02/01/2021 15:10:44;0001;Server@fermi;Svr;net_restore_handler;net restore handler called
02/01/2021 15:10:45;0100;Server@fermi;Req;;Type 0 request received from root@localhost, sock=16
02/01/2021 15:10:45;0100;Server@fermi;Req;;Type 95 request received from root@localhost, sock=17
02/01/2021 15:10:45;0100;Server@fermi;Req;;Type 0 request received from root@localhost, sock=17
02/01/2021 15:10:45;0100;Server@fermi;Req;;Type 95 request received from root@localhost, sock=18
02/01/2021 15:10:45;0100;Server@fermi;Req;;Type 98 request received from root@localhost, sock=16

with the last message denoting error code 15008:

02/01/2021 15:10:45;00a0;Server@fermi;Req;req_reject;Reject reply code=15008, aux=0, type=98, from root@localhost

mom_logs:

02/01/2021 15:10:41;0002;pbs_mom;Svr;Log;Log opened
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;pbs_version=20.0.0
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;hostname=fermi;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost 
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;ipv4 interface enp4s0: fermi 
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: ip6-loopback 
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;ipv6 interface enp4s0: fermi 
02/01/2021 15:10:41;0100;pbs_mom;Svr;parse_config;file config
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.1.1 as authorized
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.0.100 as authorized
02/01/2021 15:10:41;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
02/01/2021 15:10:41;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
02/01/2021 15:10:41;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
...
02/01/2021 15:10:41;0002;pbs_mom;n/a;ncpus;hyperthreading enabled
02/01/2021 15:10:41;0002;pbs_mom;n/a;initialize;pcpus=32, OS reports 32 cpu(s)
02/01/2021 15:10:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP authentication method = resvport
02/01/2021 15:10:41;0c06;pbs_mom;TPP;pbs_mom(Main Thread);TPP leaf node names = 192.168.0.100:15003,127.0.0.1:15003,192.168.0.100:15003
02/01/2021 15:10:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
02/01/2021 15:10:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 16384
02/01/2021 15:10:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
02/01/2021 15:10:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm fermi:17001
02/01/2021 15:10:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Thread ready
02/01/2021 15:10:41;0006;pbs_mom;Fil;pbs_mom;Version 20.0.0, started, initialization type = 0
02/01/2021 15:10:41;0002;pbs_mom;Svr;pbs_mom;Mom pid = 471684 ready, using ports Server:15001 MOM:15002 RM:15003
02/01/2021 15:10:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.0.100:15003 to pbs_comm fermi:17001
02/01/2021 15:10:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm fermi:17001
02/01/2021 15:10:41;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
02/01/2021 15:10:43;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at fermi:15001, stream:0
02/01/2021 15:10:43;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.100:15001 on stream 0
02/01/2021 15:10:43;0002;pbs_mom;Svr;im_eof;Server closed connection.
02/01/2021 15:10:47;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at fermi:15001, stream:1
02/01/2021 15:10:47;0002;pbs_mom;Svr;pbs_mom;ReplyHello from server at 192.168.0.100:15001
02/01/2021 15:15:28;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.100:15001 on stream 1
02/01/2021 15:15:28;0002;pbs_mom;Svr;im_eof;Server closed connection.
02/01/2021 15:15:28;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at fermi:15001, stream:2
02/01/2021 15:15:28;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.100:15001 on stream 2
02/01/2021 15:15:28;0002;pbs_mom;Svr;im_eof;Server closed connection.

sched_logs:

02/01/2021 15:40:15;0002;pbs_sched;Svr;Log;Log opened
02/01/2021 15:40:15;0002;pbs_sched;Svr;pbs_sched;pbs_version=20.0.0
02/01/2021 15:40:15;0002;pbs_sched;Svr;pbs_sched;pbs_build=mach=N/A:security=N/A:configure_args=N/A
02/01/2021 15:40:15;0002;pbs_sched;Svr;pbs_sched;hostname=fermi;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
02/01/2021 15:40:15;0002;pbs_sched;Svr;pbs_sched;ipv4 interface lo: localhost 
02/01/2021 15:40:15;0002;pbs_sched;Svr;pbs_sched;ipv4 interface enp4s0: fermi 
02/01/2021 15:40:15;0002;pbs_sched;Svr;pbs_sched;ipv6 interface lo: ip6-loopback 
02/01/2021 15:40:15;0002;pbs_sched;Svr;pbs_sched;ipv6 interface enp4s0: fermi 
02/01/2021 15:40:15;0002;pbs_sched;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
02/01/2021 15:40:15;0006;pbs_sched;Fil;pbs_sched;Version 20.0.0, started, initialization type = 0
02/01/2021 15:40:15;0002;pbs_sched;Svr;sched_main;/opt/pbs/sbin/pbs_sched startup pid 476837
02/01/2021 15:40:15;0040;pbs_sched;Fil;fairshare usage;Creating usage database for fairshare
02/01/2021 15:40:15;0080;pbs_sched;Req;;Launching 16 worker threads
02/01/2021 15:40:19;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn't register the scheduler default with the configured servers
02/01/2021 15:40:21;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn't register the scheduler default with the configured servers
02/01/2021 15:40:23;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn't register the scheduler default with the configured servers
02/01/2021 15:40:25;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn't register the scheduler default with the configured servers
02/01/2021 15:40:27;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in connect_svrpool, Couldn't register the scheduler default with the configured servers    

comm_logs:

02/01/2021 15:40:15;0002;Comm@fermi;Svr;Log;Log opened
02/01/2021 15:40:15;0002;Comm@fermi;Svr;Comm@fermi;pbs_version=20.0.0
02/01/2021 15:40:15;0002;Comm@fermi;Svr;Comm@fermi;pbs_build=mach=N/A:security=N/A:configure_args=N/A
02/01/2021 15:40:15;0002;Comm@fermi;Svr;Comm@fermi;hostname=fermi;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
02/01/2021 15:40:15;0002;Comm@fermi;Svr;Comm@fermi;ipv4 interface lo: localhost 
02/01/2021 15:40:15;0002;Comm@fermi;Svr;Comm@fermi;ipv4 interface enp4s0: fermi 
02/01/2021 15:40:15;0002;Comm@fermi;Svr;Comm@fermi;ipv6 interface lo: ip6-loopback 
02/01/2021 15:40:15;0002;Comm@fermi;Svr;Comm@fermi;ipv6 interface enp4s0: fermi 
02/01/2021 15:40:15;0002;Comm@fermi;Svr;Comm@fermi;/opt/pbs/sbin/pbs_comm ready (pid=476815), Proxy Name:fermi:17001, Threads:4
02/01/2021 15:40:15;0000;Comm@fermi;Svr;Comm@fermi;Supported authentication method: resvport
02/01/2021 15:40:15;0c06;Comm@fermi;TPP;Comm@fermi(Thread 1);Thread ready
02/01/2021 15:40:15;0c06;Comm@fermi;TPP;Comm@fermi(Thread 0);Thread ready
02/01/2021 15:40:15;0c06;Comm@fermi;TPP;Comm@fermi(Thread 2);Thread ready
02/01/2021 15:40:15;0c06;Comm@fermi;TPP;Comm@fermi(Thread 3);Thread ready
02/01/2021 15:40:15;0c06;Comm@fermi;TPP;Comm@fermi(Thread 1);tfd=14, Leaf registered address 192.168.0.100:15003
02/01/2021 15:40:18;0c06;Comm@fermi;TPP;Comm@fermi(Thread 2);tfd=16, Leaf registered address 192.168.0.100:15001
02/01/2021 15:46:14;0c06;Comm@fermi;TPP;Comm@fermi(Thread 2);tfd=16, Connection from leaf 192.168.0.100:15001 down
02/01/2021 15:46:14;0c06;Comm@fermi;TPP;Comm@fermi(Thread 1);tfd=14, Connection from leaf 192.168.0.100:15003 down
02/01/2021 15:46:14;0001;Comm@fermi;Svr;Comm@fermi;stop_me, Caught signal 15

My configuration:

badin@fermi:~/gmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch Priority = 50
set queue batch enabled = True
set queue batch started = True
#
# Create and define queue prior
#
create queue prior
set queue prior queue_type = Execution
set queue prior Priority = 1000
set queue prior enabled = True
set queue prior started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = localhost
set server acl_hosts += fermi
set server acl_users = badin@localhost
set server acl_users += badin@fermi
set server acl_users += root@localhost
set server acl_roots = badin@localhost
set server acl_roots += badin@fermi
set server acl_roots += root@localhost
set server managers = badin@fermi
set server managers += root@localhost
set server operators = badin@fermi
set server operators += root@localhost
set server default_queue = batch
set server log_events = 511
set server mailer = /usr/sbin/sendmail
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server resources_default.nodect = 1
set server resources_default.nodes = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server node_pack = False
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999

/etc/pbs.conf:

PBS_EXEC=/opt/pbs
PBS_SERVER=fermi
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=16
PBS_SCP=/usr/bin/scp

/etc/hosts:

127.0.0.1       localhost
127.0.1.1       fermi
192.168.0.100   fermi

/etc/hosts_equiv:

+ badin

/var/spool/pbs/mom_priv/config:

$clienthost fermi
$restrict_user_maxsysid 999

Check for host resolvability:

badin@fermi:~$ pbs_hostn -v fermi
primary name: fermi (from gethostbyname())
aliases:            -none-
     address length:  4 bytes
     address:            127.0.1.1   (16842879 dec)  name:  fermi
     address:        192.168.0.100   (1677764800 dec)  name:  fermi

nmaps:

badin@fermi:~$ nmap 127.0.0.1
Starting Nmap 7.80 ( https://nmap.org ) at 2021-02-01 16:05 CET
Nmap scan report for localhost (127.0.0.1)
Host is up (0.000035s latency).
Not shown: 992 closed ports
PORT      STATE SERVICE
22/tcp    open  ssh
25/tcp    open  smtp
111/tcp   open  rpcbind
587/tcp   open  submission
631/tcp   open  ipp
5432/tcp  open  postgresql
15002/tcp open  onep-tls
15003/tcp open  unknown

badin@fermi:~$ nmap 192.168.0.100
Starting Nmap 7.80 ( https://nmap.org ) at 2021-02-01 16:05 CET
Nmap scan report for fermi (192.168.0.100)
Host is up (0.000036s latency).
Not shown: 996 closed ports
PORT      STATE SERVICE
22/tcp    open  ssh
111/tcp   open  rpcbind
15002/tcp open  onep-tls
15003/tcp open  unknown

badin@fermi:~$ nmap fermi
Starting Nmap 7.80 ( https://nmap.org ) at 2021-02-01 16:06 CET
Nmap scan report for fermi (127.0.1.1)
Host is up (0.000036s latency).
Other addresses for fermi (not scanned): 192.168.0.100
Not shown: 996 closed ports
PORT      STATE SERVICE
22/tcp    open  ssh
111/tcp   open  rpcbind
15002/tcp open  onep-tls
15003/tcp open  unknown

Ubuntu firewalls:

badin@fermi:~$ sudo ufw status verbose
Status: active
Logging: on (low)
Default: deny (incoming), allow (outgoing), disabled (routed)
New profiles: skip

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW IN    Anywhere                  
15002                      ALLOW IN    Anywhere                  
15003                      ALLOW IN    Anywhere                  
15001                      ALLOW IN    Anywhere                  
17001                      ALLOW IN    Anywhere                  
22/tcp (v6)                ALLOW IN    Anywhere (v6)             
15002 (v6)                 ALLOW IN    Anywhere (v6)             
15003 (v6)                 ALLOW IN    Anywhere (v6)             
15001 (v6)                 ALLOW IN    Anywhere (v6)             
17001 (v6)                 ALLOW IN    Anywhere (v6)   

pbsnodes -a

badin@fermi:~$ pbsnodes -a
fermi
     Mom = fermi
     ntype = PBS
     state = free
     pcpus = 32
     resources_available.arch = linux
     resources_available.host = fermi
     resources_available.mem = 65832120kb
     resources_available.ncpus = 16
     resources_available.vnode = fermi
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     license = l
     last_state_change_time = Mon Feb  1 16:03:56 2021

The 192.168.0.100 is assigned by the local router. I understand that the problem lies in

02/01/2021 15:46:14;0c06;Comm@fermi;TPP;Comm@fermi(Thread 2);tfd=16, Connection from leaf 192.168.0.100:15001 down
02/01/2021 15:46:14;0c06;Comm@fermi;TPP;Comm@fermi(Thread 1);tfd=14, Connection from leaf 192.168.0.100:15003 down

but I do not understand what is causing it nor how to solve it. I would kindly appreciate any help to resolve this issue.

  1. Please remove this loop back address assigned to fermi.
  2. Is 192.168.0.100 static IP address assigned to fermi ? Please make it static and resolvable (forward and reverse ). # pbs_hostn -v fermi
  3. Also, please make sure 15001 to 15009 and 17001 are open for communication for all the pbs_ daemons.
  4. disable firewall, disable SELinux , reboot the system if you disable the SELinux now
  5. set it to unlimited instead of 16 and restart the pbs services

Hope this will resolve the issue. Again, thanks for the above diagnostic data. very helpful.

Thank you very much for the suggestions.

Removed.

Yes, it is static and it is resolvable.

badin@fermi:~$ pbs_hostn -v fermi
primary name: fermi (from gethostbyname())
aliases:            -none-
     address length:  4 bytes
     address:        192.168.0.100   (1677764800 dec)  name:  fermi

Firewall disabled, SELinux was never enabled.

Set.

Thanks now everything works :smile: I suspect, that the error was actually caused by PBS_CORE_LIMIT – I misunderstood it as the number of CPU threads and not interpreted it as the maximum allowed size for core files, thank you very much. I think that this was also the mistake which lead to behaviour here https://community.openpbs.org/t/mom-config-incomplete/411

1 Like