Jobs not getting Run. Goes to E state and stops

I have a new install of OpenPBS 20.1 . OS - Ubuntu 18.04

I was getting error -

qsub: Bad UID for job execution

When a user submits the job.


But user permissions gid / uid everything was all right.

So, i added - * 'qmgr -c “set server flatuid=true”

and restarted pbs services.


Now user is able to submit his jobs. but job does not run. and goes to E state and exit. What can be the possible problem -

Screenshot -

It looks like the job ran, but failed to return the output files. Ensure the user is able to ssh/scp between the submission host and the execution host.

The PBS Pro documentation is located here: Altair Product Documentation - Altair Community

In the filter on the left select PBS Professional as the product and 2020.1 as the version. If you are using shared filesystems, check out the $usecp directive in the Administration Guide.

Check out section 14.6 in the Administration Guide for details on file transfer.

Dear Michael,

I rechecked ssh. ssh-keygens are copied and there is no issue in that. i removed the pbs complety and reinstalled 22 version.
But still same issue. I use NFS to share the /home directory from headnode to compute nodes.
Manually I am able to touch / write any file from any node. I checked it. So NFS does not seem to have any issue.

So basically its local copy for MOM. Below is my /etc/pbs.conf

headnode -

PBS_EXEC=/opt/pbs
PBS_SERVER=u0
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp

Compute node -

PBS_EXEC=/opt/pbs
PBS_SERVER=u0
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp


Should I make the PBS_SCP=/bin/cp


Sorry, if my questions are very dumb. I am still very new and learning pbs. Hoping to understand it better.

The $usecp directive in the mom configuration file may be used to indicate which filesystems are shared. For example, if /home/user is mounted via NFS then you may use $usecp to indicate this. In that case, cp will be used instead of scp to copy the output file. There is also a qsub parameter (-k oe) that you may use to tell PBS to just leave the output where it is without attempting to copy back to the submission host. Please verify:

  1. /usr/bin/scp exists on the execution hosts
  2. You are able to scp a file from the execution host back to the submission host without a password

It may be that the first time you attempt to use scp from the execution host it is prompting whether to accept the finger print of the remote host. After that, it doesn’t prompt you anymore. It’s possible to use the -o StrictHostKeyChecking=no parameter on ssh/scp to disable strict host key checking and avoid the prompt.

I see this line in the tracejob output
image

It seems when it is trying to stageout it is unable to preserve the permissions of the files being copied or the user does not have enough permission to copy back to that destination.

Can you please add the below directive and test again.
#PBS –W suppress_email=-1

Hi Adarsh,
It gives below error -
qsub: directive error: –W suppress_email=-1

Hi Michael,

  1. i am able to scp / ssh freely on all nodes. There is no finger print issue. I rechecked it.
    I am able to Read/Write files from any node for that user.

UID / GID are same. Also checked this. not sure why pbs is not able to read / write.

In CentOs i didnt faced this issue. Just did the basic install. Facing it on ubuntu.

I hope not, it is a copy paste special character insert, did you get a chance to try typing the hypen (-) instead of copy paste, if you have done that, then that feature is not exposed/added to the version of openpbs you are using.

Hi Adarsh,
Sorry it was the copy paste issue. I manually typed it. Anyways.

@mkaro and @adarsh .very much thankful for your support. I tried to run mpi jobs manually and it was openmpi library issue.

after creating soft links for the libraries not found
ln -s /usr/lib/x86_64-linux-gnu/libmpi_cxx.so /usr/lib/x86_64-linux-gnu/libmpi_cxx.so.1
ln -s /usr/lib/x86_64-linux-gnu/libmpi.so /usr/lib/x86_64-linux-gnu/libmpi.so.12

on all nodes.
openmpi started working. it was not the pbs issue. Thanks a lot guys.

1 Like