Some exit codes >0 from a child process still report a job exit code of 0

I have jobs that in some cases are finishing with qstat reporting an exit status of 0 even though a child process from a Singularity container that provided the job environment reported an exit code >=1. My users are asking me to make it so qstat recognizes when a child process fails.

I have some experimentation in mind I can try. Before I do I figured I’d ask this community for any experiences with this situation. Please share your thoughts.

Basic Info:

  • openpbs 20.0.1
  • Ubuntu 18.04
  • Singularity 3.9.9

It appears the qsub argument “-sync y” may be what I need. However, that argument appears it may be depreciated.

Which qsub are you looking at?

I’m not following you. What do you mean which qsub? My workflow executes qsub to run an array job.

Could you please share the output of the below command

qsub --version
qsub -h

ah, what qsub version. Now I follow you

qsub --version
pbs_version = 20.0.1

qsub -h
Job script will be read from standard input. Submit with CTRL+D.

I’m going to test with the directive #PBS -W block=true

Based on the documentation, that may force the job to wait until the child process finishes and collect its exit code. Does that make sense?

Where did you read about the “sync y” option? That does not show up in OpenPBS.

Here are some links. The sync option is either depreciated or an option with other schedulers forked off PBS. I also did not see any mention of it in recent/current PBS documentation

https://slurm-dev.schedmd.narkive.com/hSPCuZbk/how-to-emulate-qsub-s-sync-y-wblock-true
https://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Got back to this and tested the directive #PBS -W block=true. I still saw the job with a child process in a singularity container produce an exit code of 1 while qstat still reported a sucessfullly finished job.

I have some other ideas I’m going to test