I am running a cluster called archer. Its running fine. I have a remote machine called Topgun. I installed the server software on Topgun and pointed to Archer as the server. I can do a “qstat” and see jobs running on archer. I would like to submit jobs on Topgun to be run on Archer, but it just sits there.
Please share the the below command output
**Archer**:
1. qstat -Bf
2. pbsnodes -aSjv
3. cat /etc/pbs.conf
4. qstat -answ1
**Topgun**
1. qstat -Bf
2. pbsnodes -av
3. cat /etc/pbs.conf
4. qstat -answ1
Hard to provide that info since I am in a classified environment and the commands are providing fully qualified domain names.
If I try to submit an interactive job on topgun I get this:
-bash-4.2$ qsub -I
qsub: waiting for job 194734.archer.xxx.xxx to start
Qstat on both machines see the jobs running on archer. pbsnodes -a shows only the nodes on Archer, which is what I want. I don’t want jobs to run on Topgun, I only want to run jobs on the Archer cluster,
08/02/2022 17:51:53;0080;Server@archer;Req;req_reject;Reject reply code=15139, aux=0, type=8, from ramos@topgun.xxx.xxx.xx
Archer see the request, just refuses it
PBS_SERVER=archer.xx.xx.xx
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
pbs.conf is identical on both machines.
I am finally able to submit a job after setting up the ssh keys, but interactive jobs fail.
Thank you for all the above details and tests.
- Interactive job might be failing due to firewalls or ports being blocked
Please refer:
Interactive Job errors out with 'apparently deleted' - #13 by scc
Interactive Job errors out with 'apparently deleted' - #16 by adarsh
Please refer: Where to submit a job instead of pbs server - #10 by scc
Thank you
Then the /etc/pbs.conf file on the Topgun should be (as it is only used to submit jobs to Archer and would be like a client node or login node , it would not be running any of the pbs services/daemons but has only the command line tools for job managment)
PBS_SERVER=archer.xx.xx.xx
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
Please see the PBS Professional Installation and Upgrade Guide, especially chapter 1, “PBS Architecture”, chapter 2, “Pre-installation Steps”, and chapter 3, “Installation”.
I have it working, to a degree. I have another cluster called Maury. It is the conventional cluse with a head node and two login nodes, Maury1 and Maury. We have another machine that doesn’t show up with “pbsnodes -a”, but can submit jobs. That machine can also 'qsub -I" and land on one of the compute nodes. The users is satisfied now that he can submit jobs, but he can’t run an interactive job.
Please note that
pbsnodes -a
# shows only the compute node(s) added to the pbs server using the command
qmgr -c "create node node-hostname"
If you have login nodes (say none of the PBS Services are running on them, but only commands are deployed ) , these would not be part of the compute resources and would not show in pbsnodes -a output.
I understand all of that I made all of that, I installed PBS on it. I know I installed the machine that can submit job remotely, because the PBS software is in my home directory. I was originally based on PBS 13 something. I upgraded that cluster a couple of years ago. It is a paid for version with support. I just don’t understand why I can run interactive jobs on that cluster from a machine that doesn’t show up on “pbsnodes -a”.