Run OpenPBS in private network

gurvirbawa · April 18, 2024, 3:28pm

Hello,
I have setup a small Linux cluster (1 server with 2 execution nodes) on university’s network (i.e., IPs 155.X.X.X). This setup works fine. We also have a private network for our department with IPs 10.X.X.X where the communication is much faster. So a user connects through university’s network through ssh abc@vortex.edu. Once in vortex.edu a user can only then connect to faster network through ssh vortex.data.
Now I am trying to configure OpemPBS (23.06.06) such that it communicates over .data network instead of .edu network.
So far, I have tried to change server name in /etc/pbs.conf to vortex.data on all the machines and while creating nodes, I entered node names as wx1.data and wx2.data. But I get below error messages:
server_logs:
04/18/2024 10:44:05;0c06;Server@vortex;TPP;Server@vortex(Thread 0);tpp_mbox_read;Unable to read from msg box
04/18/2024 10:44:05;0c06;Server@vortex;TPP;Server@vortex(Thread 0);tpp_mbox_read;Unable to read from msg box
04/18/2024 10:44:05;0001;Server@vortex;Svr;Server@vortex;is_request, bad attempt to connect from 155.X.X.X:15003

mom_logs:
04/18/2024 10:44:05;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
04/18/2024 10:44:05;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 155.X.X.X:15001 on stream 110
04/18/2024 10:44:05;0002;pbs_mom;Svr;im_eof;Server closed connection.

Any suggestions in this regard would be greatly appreciated. What settings should I check and change?
(Just to mention again: OpenPBS works fine on cluster though network 155.X.X.X. I get these above messages when in pbsnodes I had mentioned wx1.data and wx2.data and in /etc/pbs.conf server name was vortex.data – not sure why in the error messages it still says bad request from 155.X.X.X)

Source · April 19, 2024, 7:42am

Dive into the PBS_LEAF_NAME or something?

gurvirbawa · April 19, 2024, 12:14pm

Thanks for the suggestion.
I added PBS_LEAF_NAME to all /etc/pbs.conf i.e., on host server PBS_LEAF_NAME=vortex.data, and on two execution nodes PBS_LEAF_NAME is set to wx1.data and wx2.data respectively.
Also, in /etc/hosts I’ve added the IPs and the names i.e., 10.X.X.X vortex.data, and so on two execution nodes.
However, I get “qsub: Bad UID for job execution” when i submit the job now.
Are there any other settings that need to be changed?

gurvirbawa · April 19, 2024, 12:41pm

Fixed Bad UID error by setting Qmgr: set server flatuid = True.
But now the jobs stay in Q with the message: Insufficient amount of resource: ncpus (R: 2 A: 0 T: 0)

I deleted nodes wx1 and wx2 and then added wx1.data and wx2.data. But it didn’t seem to make any difference. Below is the output of pbsnodes -av

wxml1.data
Mom = wxml1.data
ntype = PBS
state = state-unknown,down
resources_available.host = wxml1
resources_available.vnode = wxml1.data
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

wxml2.data
Mom = wxml2.data
ntype = PBS
state = state-unknown,down
resources_available.host = wxml2
resources_available.vnode = wxml2.data
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

Any suggestions would be much appreciated.

gurvirbawa · April 19, 2024, 1:06pm

Als, comm_logs are as below:
04/19/2024 08:27:32;0c06;Comm@vortex;TPP;Comm@vortex(Thread 2);tfd=16, Connection from leaf 10.1.0.48:15003 down
04/19/2024 08:27:37;0c06;Comm@vortex;TPP;Comm@vortex(Thread 1);tfd=14, Connection from leaf 10.1.0.58:15003 down

SeELinux is disabled on these machines and I am able to connect to these via passwordless ssh.

Source · April 19, 2024, 1:16pm

From where comes the ‘m’ in the hostname? If there is no typo, check the hosts or DNS maybe?

gurvirbawa · April 19, 2024, 1:18pm

Sorry the names of execution nodes are wxml1 and wxml2. Or were you referring to any other typo?

Source · April 19, 2024, 1:46pm

I mean the ‘m’ in the nodenames exactly. It is said in the document we dont need to setup leaf for the pbs_comm. Yet the error indicate it might be the problem.
Try PBS_LEAF_ROUTERS from doc section 4.4.1 as well

PBS_LEAF_ROUTERS=<host>[:<port>][,<host>[:>port>]]

gurvirbawa · April 19, 2024, 2:05pm

If I understood it correctly, does that mean add PBS_LEAF_ROUTERS on each execution node to point to vortex.data i.e., on both wxml1 and wxml2 add this value to /etc/pbs.conf:
PBS_LEAF_ROUTERS=vortex.data

(sorry if its a silly question)

Source · April 20, 2024, 1:31pm

Yah, setting on each execution node. I would suggest to add all the interfaces to PBS_LEAF_ROUTERS, such as
PBS_LEAF_ROUTERS=vortex.data,vortex.edu

gurvirbawa · April 20, 2024, 2:54pm

Thanks for clarifying it. I made the changes but its not working. Perhaps details below would help.

/etc/pbs.con (on hostnode)
PBS_EXEC=/opt/pbs
PBS_SERVER=vortex
PBS_LEAF_NAME=vortex.data
#PBS_LEAF_ROUTERS=vortex.data
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

/etc/pbs.info (for wxml1)
PBS_EXEC=/opt/pbs
PBS_SERVER=vortex
PBS_LEAF_NAME=wxml1.data
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_LEAF_ROUTERS=vortex.data,vortex.dbwks.erau.edu

/etc/pbs.conf (for wxml2)
PBS_EXEC=/opt/pbs
PBS_SERVER=vortex
PBS_LEAF_NAME=wxml2.data
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_LEAF_ROUTERS=vortex.data,vortex.dbwks.erau.edu

sched_logs:
04/20/2024 10:39:45;0080;pbs_sched;Req;;Starting Scheduling Cycle
04/20/2024 10:39:45;0080;pbs_sched;Job;38.vortex;Considering job to run
04/20/2024 10:39:45;0040;pbs_sched;Job;38.vortex;Insufficient amount of resource: ncpus (R: 2 A: 0 T: 0)
04/20/2024 10:39:45;0040;pbs_sched;Job;38.vortex;Job will never run with the resources currently configured in the complex
04/20/2024 10:39:45;0080;pbs_sched;Req;;Leaving Scheduling Cycle

comm_logs:
04/20/2024 10:50:14;0c06;Comm@vortex;TPP;Comm@vortex(Thread 3);tfd=18, Leaf registered address 10.1.0.48:15003
04/20/2024 10:50:14;0c06;Comm@vortex;TPP;Comm@vortex(Thread 3);tfd=18, Leaf 10.1.0.48:15003 still connected while another leaf connect arrived, dropping existing connection 16
04/20/2024 10:50:14;0c06;Comm@vortex;TPP;Comm@vortex(Thread 2);tfd=16, Connection from leaf 10.1.0.48:15003 down

server_logs:
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 0 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 95 request received from bawag@vortex.dbwks.erau.edu, sock=19
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 21 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 1 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 3 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 5 request received from bawag@vortex.dbwks.erau.edu, sock=18
04/20/2024 10:39:45;0100;Server@vortex;Job;38.vortex;enqueuing into gfs, state Q hop 1
04/20/2024 10:39:45;0008;Server@vortex;Job;38.vortex;Job Queued at request of bawag@vortex.dbwks.erau.edu, owner = bawag@vortex.dbwks.erau.edu, job name = pbsdshtest, queue = gfs
04/20/2024 10:39:45;0040;Server@vortex;Svr;vortex;Scheduler sent command 1
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 21 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 71 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 58 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 20 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 51 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Req;;Type 96 request received from Scheduler@vortex.dbwks.erau.edu, sock=16
04/20/2024 10:39:45;0100;Server@vortex;Hook;;modifyjob event: accept req by default
04/20/2024 10:39:45;0008;Server@vortex;Job;38.vortex;Job Modified at request of Scheduler@vortex.dbwks.erau.edu

mom_logs (on wxml1)
04/20/2024 10:52:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 10.1.0.48:15003 to pbs_comm vortex.dbwks.erau.edu:17001
04/20/2024 10:52:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm vortex.dbwks.erau.edu:17001
04/20/2024 10:52:08;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
04/20/2024 10:52:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm vortex.dbwks.erau.edu:17001 down
04/20/2024 10:52:08;0001;pbs_mom;Svr;net_down_handler;net down handler called
04/20/2024 10:52:08;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
04/20/2024 10:52:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm vortex.data:17001 down

mom_logs (on wxml2):
04/20/2024 10:53:06;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 10.1.0.58:15003 to pbs_comm vortex.dbwks.erau.edu:17001
04/20/2024 10:53:06;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm vortex.dbwks.erau.edu:17001
04/20/2024 10:53:06;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
04/20/2024 10:53:06;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm vortex.dbwks.erau.edu:17001 down
04/20/2024 10:53:06;0001;pbs_mom;Svr;net_down_handler;net down handler called
04/20/2024 10:53:06;0001;pbs_mom;Svr;net_restore_handler;net restore handler called

Am I doing something wrong?

gurvirbawa · April 20, 2024, 3:05pm

Also, below is the output from pbsnodes -av

wxml1
Mom = wxml1.db.erau.edu
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = wxml1
resources_available.vnode = wxml1
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

wxml2
Mom = wxml2.db.erau.edu
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = wxml2
resources_available.vnode = wxml2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

gurvirbawa · April 20, 2024, 5:58pm

I think the issue is with comm, as per the error below:

04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 2);tfd=18, Leaf registered address 10.1.0.58:15003
04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 2);tfd=18, Leaf 10.1.0.58:15003 still connected while another leaf connect arrived, dropping existing connection 16
04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 1);tfd=16, Connection from leaf 10.1.0.58:15003 down

04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 1);tfd=18, Leaf registered address 10.1.0.48:15003
04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 1);tfd=18, Leaf 10.1.0.48:15003 still connected while another leaf connect arrived, dropping existing connection 16
04/20/2024 12:43:47;0c06;Comm@vortex;TPP;Comm@vortex(Thread 3);tfd=16, Connection from leaf 10.1.0.48:15003 down

Any suggestions would be highly appreciated.

gurvirbawa · April 21, 2024, 9:04pm

Fixed! Followed steps from:

Created nodes using leaf names i.e., wxml1.data and wxml2.data.
Added PBS-LEAF_ROUTERS=vortex.data attribute to each MOM.

Topic		Replies	Views
Proper way to configure PBS on multiple NIC system Users/Site Administrators	1	2575	March 19, 2019
Set or not set PBS_LEAF_NAME parameter Users/Site Administrators	8	1802	July 29, 2021
PBS on multihomed VMs Users/Site Administrators	2	1740	January 17, 2018
Setting up openPBS on only one computer Users/Site Administrators	16	3825	October 27, 2020
Pbsnodes: cannot connect to server , error=111 and Failed to start PBS dataservice Users/Site Administrators	6	7592	October 2, 2021

Run OpenPBS in private network

Related topics