Unable to read msg box when job attempts to run on node

Hello,

I am getting the following log message when attempting to submit new jobs. The job attempts to run but then gets put into a Held state. All nodes have AD authentication, so we do not use the /etc/passwd file for users.

pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
05/20/2024 14:54:54;0028;pbs_mom;Job;8.wwadm01;No Password Entry for User ********

Any suggestions?

Can you please try to SSH as that “user” from PBS server host to compute node host and then back ? Whether this works ?

We do not allow ssh to compute nodes as regular users. Only when jobs owned by that user are running on a compute node are they supposed to be allowed to ssh in.

Even as root I am unable to run an ID lookup of a user in our AD. I have the gateway set on the compute nodes to go through the head to go out to the domain controllers, but still nothing.

Any ideas in terms of networking I should check?

To make sure user can ssh into the compute node and back , it is better to disable the $restrict_user setting for one experiment to make sure user can ssh successfully.

Reference: Linux Error “No passwd entry for user” | Baeldung on Linux

As root user ssh to the compute node and then do su - whether it works

I am able to freely ssh between nodes as a regular user. Jobs submit now and finish but during the exit process I still get the “Unable to read msg box” error

I am also getting the following sys_copy failed, return value=1 with a connection refused when the compute nodes attempt to copy back to the login node/submit host. There is currently no firewall on the login node, and I am able to ping and ssh freely between nodes. Any ideas?

copy back will preserve permissions (group, user , other), so that might be one of the caveats
you can increase the mom log debug level and , retest with a sample job and check the mom logs. There might be more clues.

So the problem ended up being we needed to declare which paths should be allowed to copy in the mom config file on the compute nodes.

Editing the /var/spool/pbs/mom_priv/config file did the trick.

Thanks for the support