I have a lots of job stuck in E state.
Looking at the mom log file i see the following entry:
11/15/2020 10:01:11;0080;pbs_mom;Job;19085.XXXX;copy file request received 11/15/2020 10:01:11;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/19085.XXXX.OU user@XXXX:/work/user/folder/JOBNAME.o19085 status=1, try=1
This type of issue happens usually when:
- user has delete the job working directory (eg: /work/user/folder/)
- user is out of quota
Do you know if there is a configuration value that set the maximum number of retry (of sys_copy;command)? After the max num of retries is reached i would like to free the node, otherwise the node is still stuck with jobs with state E.