Option for sister moms to not delete job's files sitting on shared location

Here’s the design to introduce a new mom config file option to not delete job’s files sitting on shared location. This is to address the issue when nodes are released early from a job via ‘pbs_release_nodes’, and the job has been submitted with sandbox=PRIVATE, and the $jobdir_root location specified is shared among the primary mom and the sister moms: https://openpbs.atlassian.net/wiki/spaces/PD/pages/1979482121/Option+for+sister+moms+to+not+delete+job+s+files+sitting+on+shared+location

Thanks for explaining the situation. I have a few questions/comments.

$jobdir_root [shared]

Does the “shared” have to be in square brackets [ ]?
What happens if nothing follows the ? Please state the behavior.
Please also add info about what happens if the word following is anything else other than “shared”?

“not cleaned up shared job’s <stageout/execution directory path>”

This read a bit like an error happened…but the behavior is intentional. How about something like “The $jobdir_root location is shared, and will be cleaned up [or you could say “removed”] once the job ends”
Maybe “$jobdir_root” can be replaced by the actual value too…?

The [] was meant to convey that it’s an optional directive, following man page format. One can specify ‘shared’ or not. I’ll make that clearer on the next version.
It’s a good idea to also flag a warning (or error) if something other than ‘shared’ is specified. I’ll add something to the design.

Ok, I’ll improve the message.

It’s a good idea to also flag a warning (or error) if something other than ‘shared’ is specified. I’ll add something to the design.

On second thought, most of the mom config file options with optional directives don’t really check for unrecognized values. An example is “$max_load [suspend]” where it only looks for ‘suspend’ and if it doesn’t find it, just proceed as normal. So I’m treating ‘shared’ the same way. If doesn’t find that keyword, then mom goes about it’s normal business and will start up.

That seems reasonable. Just say so in the design.

I like it, if we accept that users should be making it explicit.

there would be a way to make a file on MoM that sisters could check to discover whether the directory is shared, and then users wouldn’t have to read documentation to figure out how to set this. Then things would even work if some but not all sisters had the directory shared.

But there are pros and cons – perhaps it’s not unreasonable to ask administrators to explicitly state what the behaviour of jobdir_root is.

One question, though: what do sisters do when the directory needs to be created if “shared” is set? Do they just not create the directory at all? Check whether it’s there?

Thanks for the review!

I tried to explore if there’s a way to ‘discover’ automatically if a directory is ‘shared’, and I couldn’t really find a portable way. The directory could be shared via NFS, Samba, AFS, and who knows what other technology out there. I just figured that since admins would need to look up the document (or man page) on how to use $jobdir_root, then it’s just natural for them to see the ‘shared’ option.

Moms behavior has always been like this: when creating the user’s jobdir_root directory, mom tolerates if the directory already exists, as long as it is properly owned by user. Primary mom would create that shared job directory first, setting its owner to the user, and when a sister mom tries to create the same shared job directory, the mkdir() would return -1, but because errno is set to EEXIST, sister mom tolerates it, as long as it is indeed a directory and owned by the job’s user.

Mother superior could create a specially named hidden file and sisters could check for its presence if the directory exists. That would, I think, be portable since you directly check for the presence of a shared file.

But I’m not sure whether I like this implicit treatment better than something explicitly configured as shared…

I prefer the explicit ‘shared’ option in $jobdir_root. Otherwise, primary mom would need to come up with a unique name for a file that starts with ‘.’ (hidden), and it must be owned by root, so user can’t delete it. Also, it’s kind of strange for a user’s execution directory owned by user, containing a hidden, root-owned file. What if some user really happens to create the same file and gets an error…

Updated the title to fix the typo “node” to “not”

Thanks for the update.