Hi,
We are using pbs version 19.1.1 , we have the following setup:
my user is roy and I’m part of the “all” unix group
my colleague user is bob and he is also part of the “all” unix group
we are experiencing the following issues:
If the user (e.g. roy) submits a job within a directory that he/she is not the owner of (e.g. bob is the owner) and the cwd directory have 775 file/dir pemissions, e.g. drwxrwxr-x
Then the job is being re-run 21 times, each time it’s failing to run and eventually it ends up with the “Held” state (H)
2.If the user (e.g. roy) submits a job within a directory that he/she is not the owner of and the directory (cwd) have 777 (world access) file/dir pemissions, e.g. drwxrwxrwx
Then the jobs is running as expected and writes the output and error files under the directory (cwd)
My expectation is that if the user (e.g. roy) have both r/w/x group permissions then that should be sufficient for the job to run successfully, why would pbs need world access (+w others) for it to run?
Can you please let us know if that’s the expected behavior? is it a known bug that’s fixed in a later version?
Please note, the PBS MOM uses cp or scp based on what is being configured to copy files preserving permissions say - cp -rp or scp -Bvrp . However, you can write a wrapper script that supress these and does just a copy of the files ignoring the rest.
Say for example: You can create pbs_scp shell script and copy across all the /usr/bin/ and update it in the /etc/pbs.conf against PBS_SCP=/usr/bin/pbs_scp
pbs_scp is a wrapper to scp command and supresses all the commands to use only scp -r
Have you checked the MoM log on the host where the job tries to start? Usually, when a job gets rerun multiple times, it’s because the MoM cannot set up the environment to start the job. There should be a message giving more details about what didn’t work.
Also, what qsub options are the jobs using? Especially, -k or -W stagein? Have you tried running with -W group_list=xxx, where xxx from your example would be “all”.
And, just to double-check, could you run the following as roy from roy’s home directory and see if you get the expected group list? I.e., does it include “all”?
Thank you @dtalcott & @adarsh, @dtalcott you’ve nailed it, using -Wgroup_list=all works without an issue.
I’ve forgot to mention that we are using the “direct_write” feature (etc. -k od)
the exact command is qsub test
and test’s content is:
#!/usr/bin/python -u #PBS -j oe #PBS -k od
The output from the pbs_mom/logs is:
02/09/2021 07:59:28;0080;pbs_mom;Job;2276979[31].pbs1;task 00000001 terminated
02/09/2021 07:59:28;0008;pbs_mom;Job;2276979[31].pbs1;Terminated
02/09/2021 07:59:28;0100;pbs_mom;Job;2276979[31].pbs1;task 00000001 cput= 0:02:57
02/09/2021 07:59:28;0008;pbs_mom;Job;2276979[31].pbs1;kill_job
02/09/2021 07:59:28;0100;pbs_mom;Job;2276979[31].pbs1;cs96 cput= 0:02:57 mem=79120kb
02/09/2021 07:59:28;0008;pbs_mom;Job;2276979[31].pbs1;no active tasks
02/09/2021 07:59:28;0100;pbs_mom;Job;2276979[31].pbs1;Obit sent
02/09/2021 07:59:29;0080;pbs_mom;Job;2276979[31].pbs1;Job exited, Server acknowledged Obit
02/09/2021 07:59:29;0100;pbs_mom;Req;;Type 54 request received from root@192.168.110.24:15001, sock=3
02/09/2021 07:59:29;0080;pbs_mom;Job;2276979[31].pbs1;copy file request received
02/09/2021 07:59:29;0100;pbs_mom;Job;2276979[31].pbs1;staged 2 items out over 0:00:00
02/09/2021 07:59:29;0008;pbs_mom;Job;2276979[31].pbs1;no active tasks
02/09/2021 07:59:29;0100;pbs_mom;Req;;Type 6 request received from root@192.168.110.24:15001, sock=3
02/09/2021 07:59:29;0080;pbs_mom;Job;2276979[31].pbs1;delete job request received
02/09/2021 07:59:29;0008;pbs_mom;Job;2276979[31].pbs1;kill_job
I’m assuming it has to do with Section 2.5.5 Specifying Job Group ID in the UG?
By
default, the job runs under the primary group. The job’s group is specified in the group_list job attribute.
I believe that the primary group in linux would be “” e.g. “roy” because that’s what I get when I execute qsub -- /usr/bin/id(that’s the gid), however if the job’s gid is “roy” and not “all” why would that still be a problem - “all” belongs to my list of groups.