Jobs stuck in E state

I have a lots of job stuck in E state.

Looking at the mom log file i see the following entry:

11/15/2020 10:01:11;0080;pbs_mom;Job;19085.XXXX;copy file request received 11/15/2020 10:01:11;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/19085.XXXX.OU user@XXXX:/work/user/folder/JOBNAME.o19085 status=1, try=1

This type of issue happens usually when:

  • user has delete the job working directory (eg: /work/user/folder/)
  • user is out of quota

Do you know if there is a configuration value that set the maximum number of retry (of sys_copy;command)? After the max num of retries is reached i would like to free the node, otherwise the node is still stuck with jobs with state E.

No, there isn’t one.

In this scenario ,

  1. qdel -W force and make sure that there are no remnants of this job on the compute node(s) /var/spool/mom_priv/jobs directory, if there are any clean them up.

  2. in this scenario , it is always best to train the users not to delete their job working directory and check their quota

  3. you can check the above conditions (based on the issues you mentioned above) in server periodic hook and call qdel -W force

1 Like