I am getting the following log message when attempting to submit new jobs. The job attempts to run but then gets put into a Held state. All nodes have AD authentication, so we do not use the /etc/passwd file for users.
pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
05/20/2024 14:54:54;0028;pbs_mom;Job;8.wwadm01;No Password Entry for User ********
We do not allow ssh to compute nodes as regular users. Only when jobs owned by that user are running on a compute node are they supposed to be allowed to ssh in.
Even as root I am unable to run an ID lookup of a user in our AD. I have the gateway set on the compute nodes to go through the head to go out to the domain controllers, but still nothing.
To make sure user can ssh into the compute node and back , it is better to disable the $restrict_user setting for one experiment to make sure user can ssh successfully.
I am able to freely ssh between nodes as a regular user. Jobs submit now and finish but during the exit process I still get the “Unable to read msg box” error
I am also getting the following sys_copy failed, return value=1 with a connection refused when the compute nodes attempt to copy back to the login node/submit host. There is currently no firewall on the login node, and I am able to ping and ssh freely between nodes. Any ideas?
copy back will preserve permissions (group, user , other), so that might be one of the caveats
you can increase the mom log debug level and , retest with a sample job and check the mom logs. There might be more clues.