I have jobs that in some cases are finishing with qstat reporting an exit status of 0 even though a child process from a Singularity container that provided the job environment reported an exit code >=1. My users are asking me to make it so qstat recognizes when a child process fails.
I have some experimentation in mind I can try. Before I do I figured I’d ask this community for any experiences with this situation. Please share your thoughts.
Got back to this and tested the directive #PBS -W block=true. I still saw the job with a child process in a singularity container produce an exit code of 1 while qstat still reported a sucessfullly finished job.