No Password Entry for User error

Hi All,
I have been trying to install and get the PBS running on my small cluster consisting of two slave nodes and one master node, which is also set as 3rd execute node.
Although I can connect from slave nodes to the master node or vice versa through ssh without entering the password, my jobs are not submitted to the slave nodes but run on only master node instead. I have removed PBS from nodes and reinstalled everything from scratch as well but no luck so far. Below is the output of the log file in one of the slave nodes. Seems like password related issue but, As I mentioned above, I can even log in to nodes as root from any node through ssh without the requirement of entering the password. I also opened the ports as directed in the installation instruction and disabled β€œselinux”. I am not sure what I am missing. Would appreciate any help to solve the issue.

Thanks In advance,

03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0028;pbs_mom;Job;2067.hep-node0;No Password Entry for User ali_0
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;hep-node2 cput=00:00:00 mem=0kb
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;Obit sent
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0080;pbs_mom;Job;2067.hep-node0;delete job request received
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0028;pbs_mom;Job;2067.hep-node0;No Password Entry for User ali_0
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;hep-node2

Could anyone have any suggestions?

It means ali_0 has no entry in /etc/passwd or the user does not exist on that system ( here the compute node that the job is scheduled to run on)

Please check /etc/passwd file exists for all the nodes and if yes, whether all users are mentioned in that file.

Thanks again for your reply @adarsh . I see that master node(ali_0) and slave nodes(ali_2, ali_5) already have /etc/passwd file. Each passwd file contains an entry corresponding to its own name. For example, the passwd file in ali_0 has an entry for ali_0 only, while the passwd file in ali_2 has an entry for ali_2 only, etc. So you mean that every passwd file across all nodes has to have an entry for all users? I mean passwd file in ali_0 has to have an entry for ali_0, ali_2, and ali_5?
or Do I need to create a user on all the nodes and master node with the same username so that /etc/passwd file will include the same and only one user entry across all the nodes eventually?

Yes, thats correct
The username (with UID,GID) should be same across the systems contributing to the the cluster.

If you create the users then the passwd file will be automatically populated, make sure UID/GID of the users are same on all the nodes.

For example you have this setup: headnode, node1, node2
userA, userB , userC should be created on all these systems independently making sure the UID and GID of these users are same across the systems respectively.

Basic test: As userA@heanode can ssh into node1 and node2 ( & vice versa) and should be able to change into home directory upon login.

Thanks, @adarsh. I eventually got it to work thanks to your help. When I checked with the following commands, I see that all three nodes can communicate and execute the command. Although I have yet to check with the actual software I plan to use for my research; I am hoping that it will work seamlessly.
qsub -l select=1:ncpus=1 -l place=excl – /bin/sleep 1000
qsub -l select=1:ncpus=1 -l place=excl – /bin/sleep 1000
qsub -l select=1:ncpus=1 -l place=excl – /bin/sleep 1000
qsub -l select=1:ncpus=1 -l place=excl – /bin/sleep 1000

1 Like