Run OpenPBS in private network

Hello,
I have setup a small Linux cluster (1 server with 2 execution nodes) on university’s network (i.e., IPs 155.X.X.X). This setup works fine. We also have a private network for our department with IPs 10.X.X.X where the communication is much faster. So a user connects through university’s network through ssh abc@vortex.edu. Once in vortex.edu a user can only then connect to faster network through ssh vortex.data.
Now I am trying to configure OpemPBS (23.06.06) such that it communicates over .data network instead of .edu network.
So far, I have tried to change server name in /etc/pbs.conf to vortex.data on all the machines and while creating nodes, I entered node names as wx1.data and wx2.data. But I get below error messages:
server_logs:
04/18/2024 10:44:05;0c06;Server@vortex;TPP;Server@vortex(Thread 0);tpp_mbox_read;Unable to read from msg box
04/18/2024 10:44:05;0c06;Server@vortex;TPP;Server@vortex(Thread 0);tpp_mbox_read;Unable to read from msg box
04/18/2024 10:44:05;0001;Server@vortex;Svr;Server@vortex;is_request, bad attempt to connect from 155.X.X.X:15003

mom_logs:
04/18/2024 10:44:05;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
04/18/2024 10:44:05;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 155.X.X.X:15001 on stream 110
04/18/2024 10:44:05;0002;pbs_mom;Svr;im_eof;Server closed connection.

Any suggestions in this regard would be greatly appreciated. What settings should I check and change?
(Just to mention again: OpenPBS works fine on cluster though network 155.X.X.X. I get these above messages when in pbsnodes I had mentioned wx1.data and wx2.data and in /etc/pbs.conf server name was vortex.data – not sure why in the error messages it still says bad request from 155.X.X.X)

Dive into the PBS_LEAF_NAME or something?

Thanks for the suggestion.
I added PBS_LEAF_NAME to all /etc/pbs.conf i.e., on host server PBS_LEAF_NAME=vortex.data, and on two execution nodes PBS_LEAF_NAME is set to wx1.data and wx2.data respectively.
Also, in /etc/hosts I’ve added the IPs and the names i.e., 10.X.X.X vortex.data, and so on two execution nodes.
However, I get “qsub: Bad UID for job execution” when i submit the job now.
Are there any other settings that need to be changed?

Fixed Bad UID error by setting Qmgr: set server flatuid = True.
But now the jobs stay in Q with the message: Insufficient amount of resource: ncpus (R: 2 A: 0 T: 0)

I deleted nodes wx1 and wx2 and then added wx1.data and wx2.data. But it didn’t seem to make any difference. Below is the output of pbsnodes -av

wxml1.data
Mom = wxml1.data
ntype = PBS
state = state-unknown,down
resources_available.host = wxml1
resources_available.vnode = wxml1.data
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

wxml2.data
Mom = wxml2.data
ntype = PBS
state = state-unknown,down
resources_available.host = wxml2
resources_available.vnode = wxml2.data
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

Any suggestions would be much appreciated.

Als, comm_logs are as below:
04/19/2024 08:27:32;0c06;Comm@vortex;TPP;Comm@vortex(Thread 2);tfd=16, Connection from leaf 10.1.0.48:15003 down
04/19/2024 08:27:37;0c06;Comm@vortex;TPP;Comm@vortex(Thread 1);tfd=14, Connection from leaf 10.1.0.58:15003 down

SeELinux is disabled on these machines and I am able to connect to these via passwordless ssh.

From where comes the ‘m’ in the hostname? If there is no typo, check the hosts or DNS maybe?

Sorry the names of execution nodes are wxml1 and wxml2. Or were you referring to any other typo?

I mean the ‘m’ in the nodenames exactly. It is said in the document we dont need to setup leaf for the pbs_comm. Yet the error indicate it might be the problem.
Try PBS_LEAF_ROUTERS from doc section 4.4.1 as well

PBS_LEAF_ROUTERS=<host>[:<port>][,<host>[:>port>]]

If I understood it correctly, does that mean add PBS_LEAF_ROUTERS on each execution node to point to vortex.data i.e., on both wxml1 and wxml2 add this value to /etc/pbs.conf:
PBS_LEAF_ROUTERS=vortex.data

(sorry if its a silly question)

Yah, setting on each execution node. I would suggest to add all the interfaces to PBS_LEAF_ROUTERS, such as
PBS_LEAF_ROUTERS=vortex.data,vortex.edu

Thanks for clarifying it. I made the changes but its not working. Perhaps details below would help.

/etc/pbs.con (on hostnode)
PBS_EXEC=/opt/pbs
PBS_SERVER=vortex
PBS_LEAF_NAME=vortex.data
#PBS_LEAF_ROUTERS=vortex.data
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

/etc/pbs.info (for wxml1)
PBS_EXEC=/opt/pbs
PBS_SERVER=vortex
PBS_LEAF_NAME=wxml1.data
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_LEAF_ROUTERS=vortex.data,vortex.dbwks.erau.edu

/etc/pbs.conf (for wxml2)
PBS_EXEC=/opt/pbs
PBS_SERVER=vortex
PBS_LEAF_NAME=wxml2.data
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_LEAF_ROUTERS=vortex.data,vortex.dbwks.erau.edu

sched_logs:
04/20/2024 10:39:45;0080;pbs_sched;Req;;Starting Scheduling Cycle
04/20/2024 10:39:45;0080;pbs_sched;Job;38.vortex;Considering job to run
04/20/2024 10:39:45;0040;pbs_sched;Job;38.vortex;Insufficient amount of resource: ncpus (R: 2 A: 0 T: 0)
04/20/2024 10:39:45;0040;pbs_sched;Job;38.vortex;Job will never run with the resources currently configured in the complex
04/20/2024 10:39:45;0080;pbs_sched;Req;;Leaving Scheduling Cycle

comm_logs:
04/20/2024 10:50:14;0c06;Comm@vortex;TPP;Comm@vortex(Thread 3);tfd=18, Leaf registered address 10.1.0.48:15003
04/20/2024 10:50:14;0c06;Comm@vortex;TPP;Comm@vortex(Thread 3);tfd=18, Leaf 10.1.0.48:15003 still connected while another leaf connect arrived, dropping existing connection 16
04/20/2024 10:50:14;0c06;Comm@vortex;TPP;Comm@vortex(Thread 2);tfd=16, Connection from leaf 10.1.0.48:15003 down

server_logs:
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 0 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 95 request received from bawag@vortex.dbwks.erau.edu, sock=19
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 21 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 1 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 3 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 5 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Job;38.vortex;enqueuing into gfs, state Q hop 1
04/20/2024 10:39:45;0008;Server@vortex;Job;38.vortex;Job Queued at request of bawag@vortex.dbwks.erau.edu, owner = bawag@vortex.dbwks.erau.edu, job name = pbsdshtest, queue = gfs
04/20/2024 10:39:45;0040;Server@vortex;Svr;vortex;Scheduler sent command 1
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 21 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 71 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 58 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 20 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 51 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 96 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Hook;;modifyjob event: accept req by default
04/20/2024 10:39:45;0008;Server@vortex;Job;38.vortex;Job Modified at request of Scheduler@vortex.dbwks.erau.edu

mom_logs (on wxml1)
04/20/2024 10:52:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 10.1.0.48:15003 to pbs_comm vortex.dbwks.erau.edu:17001
04/20/2024 10:52:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm vortex.dbwks.erau.edu:17001
04/20/2024 10:52:08;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
04/20/2024 10:52:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm vortex.dbwks.erau.edu:17001 down
04/20/2024 10:52:08;0001;pbs_mom;Svr;net_down_handler;net down handler called
04/20/2024 10:52:08;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
04/20/2024 10:52:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm vortex.data:17001 down

mom_logs (on wxml2):
04/20/2024 10:53:06;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 10.1.0.58:15003 to pbs_comm vortex.dbwks.erau.edu:17001
04/20/2024 10:53:06;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm vortex.dbwks.erau.edu:17001
04/20/2024 10:53:06;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
04/20/2024 10:53:06;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm vortex.dbwks.erau.edu:17001 down
04/20/2024 10:53:06;0001;pbs_mom;Svr;net_down_handler;net down handler called
04/20/2024 10:53:06;0001;pbs_mom;Svr;net_restore_handler;net restore handler called

Am I doing something wrong?

Also, below is the output from pbsnodes -av

wxml1
Mom = wxml1.db.erau.edu
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = wxml1
resources_available.vnode = wxml1
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

wxml2
Mom = wxml2.db.erau.edu
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = wxml2
resources_available.vnode = wxml2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

I think the issue is with comm, as per the error below:

04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 2);tfd=18, Leaf registered address 10.1.0.58:15003
04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 2);tfd=18, Leaf 10.1.0.58:15003 still connected while another leaf connect arrived, dropping existing connection 16
04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 1);tfd=16, Connection from leaf 10.1.0.58:15003 down

04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 1);tfd=18, Leaf registered address 10.1.0.48:15003
04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 1);tfd=18, Leaf 10.1.0.48:15003 still connected while another leaf connect arrived, dropping existing connection 16
04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 3);tfd=16, Connection from leaf 10.1.0.48:15003 down

Any suggestions would be highly appreciated.

Fixed! Followed steps from:

Created nodes using leaf names i.e., wxml1.data and wxml2.data.
Added PBS-LEAF_ROUTERS=vortex.data attribute to each MOM.

1 Like