Jobs stuck in E state

Users/Site Administrators

matzmz November 15, 2020, 1:52pm 1

I have a lots of job stuck in E state.

Looking at the mom log file i see the following entry:

11/15/2020 10:01:11;0080;pbs_mom;Job;19085.XXXX;copy file request received 11/15/2020 10:01:11;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/19085.XXXX.OU user@XXXX:/work/user/folder/JOBNAME.o19085 status=1, try=1

This type of issue happens usually when:

user has delete the job working directory (eg: /work/user/folder/)
user is out of quota

Do you know if there is a configuration value that set the maximum number of retry (of sys_copy;command)? After the max num of retries is reached i would like to free the node, otherwise the node is still stuck with jobs with state E.

adarsh November 15, 2020, 11:05pm 2

No, there isn’t one.

In this scenario ,

qdel -W force and make sure that there are no remnants of this job on the compute node(s) /var/spool/mom_priv/jobs directory, if there are any clean them up.
in this scenario , it is always best to train the users not to delete their job working directory and check their quota
you can check the above conditions (based on the issues you mentioned above) in server periodic hook and call qdel -W force

1 Like

Topic		Replies	Views	Activity
Jobs stuck in Exiting state Users/Site Administrators	5	6773	February 3, 2019
OpenPBS Job State Users/Site Administrators	6	256	April 28, 2024
Job status showing E Users/Site Administrators	1	777	March 16, 2023
Jobs stuck in R status after power failure Users/Site Administrators	9	119	July 4, 2024
Automatic re-queue of failed jobs Users/Site Administrators	1	2505	September 20, 2017