Held jobs - pbs requires world writ access?

We are using pbs version 19.1.1 , we have the following setup:

  • my user is roy and I’m part of the “all” unix group
  • my colleague user is bob and he is also part of the “all” unix group

we are experiencing the following issues:

  1. If the user (e.g. roy) submits a job within a directory that he/she is not the owner of (e.g. bob is the owner) and the cwd directory have 775 file/dir pemissions, e.g. drwxrwxr-x
    Then the job is being re-run 21 times, each time it’s failing to run and eventually it ends up with the “Held” state (H)

2.If the user (e.g. roy) submits a job within a directory that he/she is not the owner of and the directory (cwd) have 777 (world access) file/dir pemissions, e.g. drwxrwxrwx
Then the jobs is running as expected and writes the output and error files under the directory (cwd)

My expectation is that if the user (e.g. roy) have both r/w/x group permissions then that should be sufficient for the job to run successfully, why would pbs need world access (+w others) for it to run?

Can you please let us know if that’s the expected behavior? is it a known bug that’s fixed in a later version?


Please note, the PBS MOM uses cp or scp based on what is being configured to copy files preserving permissions say - cp -rp or scp -Bvrp . However, you can write a wrapper script that supress these and does just a copy of the files ignoring the rest.

Say for example: You can create pbs_scp shell script and copy across all the /usr/bin/ and update it in the /etc/pbs.conf against PBS_SCP=/usr/bin/pbs_scp
pbs_scp is a wrapper to scp command and supresses all the commands to use only scp -r

Thank you adarsh for your reply, but that doesn’t answer my question, is that a known bug in v19.1.1 that’s fixed in a later version?


It is not bug, it must be the requirement on the systems side, as i mentioned above

Have you checked the MoM log on the host where the job tries to start? Usually, when a job gets rerun multiple times, it’s because the MoM cannot set up the environment to start the job. There should be a message giving more details about what didn’t work.

Also, what qsub options are the jobs using? Especially, -k or -W stagein? Have you tried running with -W group_list=xxx, where xxx from your example would be “all”.

And, just to double-check, could you run the following as roy from roy’s home directory and see if you get the expected group list? I.e., does it include “all”?

qsub -- /usr/bin/id

1 Like

Thank you @dtalcott & @adarsh,
@dtalcott you’ve nailed it, using -Wgroup_list=all works without an issue.

I’ve forgot to mention that we are using the “direct_write” feature (etc. -k od)
the exact command is qsub test
and test’s content is:
#!/usr/bin/python -u
#PBS -j oe
#PBS -k od

The output from the pbs_mom/logs is:
02/09/2021 07:59:28;0080;pbs_mom;Job;2276979[31].pbs1;task 00000001 terminated
02/09/2021 07:59:28;0008;pbs_mom;Job;2276979[31].pbs1;Terminated
02/09/2021 07:59:28;0100;pbs_mom;Job;2276979[31].pbs1;task 00000001 cput= 0:02:57
02/09/2021 07:59:28;0008;pbs_mom;Job;2276979[31].pbs1;kill_job
02/09/2021 07:59:28;0100;pbs_mom;Job;2276979[31].pbs1;cs96 cput= 0:02:57 mem=79120kb
02/09/2021 07:59:28;0008;pbs_mom;Job;2276979[31].pbs1;no active tasks
02/09/2021 07:59:28;0100;pbs_mom;Job;2276979[31].pbs1;Obit sent
02/09/2021 07:59:29;0080;pbs_mom;Job;2276979[31].pbs1;Job exited, Server acknowledged Obit
02/09/2021 07:59:29;0100;pbs_mom;Req;;Type 54 request received from root@, sock=3
02/09/2021 07:59:29;0080;pbs_mom;Job;2276979[31].pbs1;copy file request received
02/09/2021 07:59:29;0100;pbs_mom;Job;2276979[31].pbs1;staged 2 items out over 0:00:00
02/09/2021 07:59:29;0008;pbs_mom;Job;2276979[31].pbs1;no active tasks
02/09/2021 07:59:29;0100;pbs_mom;Req;;Type 6 request received from root@, sock=3
02/09/2021 07:59:29;0080;pbs_mom;Job;2276979[31].pbs1;delete job request received
02/09/2021 07:59:29;0008;pbs_mom;Job;2276979[31].pbs1;kill_job

I’m assuming it has to do with Section 2.5.5 Specifying Job Group ID in the UG?
default, the job runs under the primary group. The job’s group is specified in the group_list job attribute.

I believe that the primary group in linux would be “” e.g. “roy” because that’s what I get when I execute qsub -- /usr/bin/id(that’s the gid), however if the job’s gid is “roy” and not “all” why would that still be a problem - “all” belongs to my list of groups.

1 Like

I think this is fixed after 19.1.1 by

The patch (66cbbec4) is fairly limited, so you might be able to cherry-pick just it and apply it to 19.1.1 sources.

Thank you very much @dtalcott and @adarsh
This is confirming our earlier estimations that this is indeed a bug.

We will make effort to upgrade to v20.x and perhaps implement one of the workaround that you guys have published in this thread

Thank you again

Apologies, i was not aware of this bug , thank you for this information
Thank you @dtalcott for sharing the patch information.