Option for sister moms to not delete job's files sitting on shared location

Here’s the design to introduce a new mom config file option to not delete job’s files sitting on shared location. This is to address the issue when nodes are released early from a job via ‘pbs_release_nodes’, and the job has been submitted with sandbox=PRIVATE, and the $jobdir_root location specified is shared among the primary mom and the sister moms: https://openpbs.atlassian.net/wiki/spaces/PD/pages/1979482121/Option+for+sister+moms+to+not+delete+job+s+files+sitting+on+shared+location

Thanks for explaining the situation. I have a few questions/comments.

$jobdir_root [shared]

Does the “shared” have to be in square brackets [ ]?
What happens if nothing follows the ? Please state the behavior.
Please also add info about what happens if the word following is anything else other than “shared”?

“not cleaned up shared job’s <stageout/execution directory path>”

This read a bit like an error happened…but the behavior is intentional. How about something like “The $jobdir_root location is shared, and will be cleaned up [or you could say “removed”] once the job ends”
Maybe “$jobdir_root” can be replaced by the actual value too…?

The [] was meant to convey that it’s an optional directive, following man page format. One can specify ‘shared’ or not. I’ll make that clearer on the next version.
It’s a good idea to also flag a warning (or error) if something other than ‘shared’ is specified. I’ll add something to the design.

Ok, I’ll improve the message.

It’s a good idea to also flag a warning (or error) if something other than ‘shared’ is specified. I’ll add something to the design.

On second thought, most of the mom config file options with optional directives don’t really check for unrecognized values. An example is “$max_load [suspend]” where it only looks for ‘suspend’ and if it doesn’t find it, just proceed as normal. So I’m treating ‘shared’ the same way. If doesn’t find that keyword, then mom goes about it’s normal business and will start up.

That seems reasonable. Just say so in the design.

I like it, if we accept that users should be making it explicit.

there would be a way to make a file on MoM that sisters could check to discover whether the directory is shared, and then users wouldn’t have to read documentation to figure out how to set this. Then things would even work if some but not all sisters had the directory shared.

But there are pros and cons – perhaps it’s not unreasonable to ask administrators to explicitly state what the behaviour of jobdir_root is.

One question, though: what do sisters do when the directory needs to be created if “shared” is set? Do they just not create the directory at all? Check whether it’s there?

Thanks for the review!

I tried to explore if there’s a way to ‘discover’ automatically if a directory is ‘shared’, and I couldn’t really find a portable way. The directory could be shared via NFS, Samba, AFS, and who knows what other technology out there. I just figured that since admins would need to look up the document (or man page) on how to use $jobdir_root, then it’s just natural for them to see the ‘shared’ option.

Moms behavior has always been like this: when creating the user’s jobdir_root directory, mom tolerates if the directory already exists, as long as it is properly owned by user. Primary mom would create that shared job directory first, setting its owner to the user, and when a sister mom tries to create the same shared job directory, the mkdir() would return -1, but because errno is set to EEXIST, sister mom tolerates it, as long as it is indeed a directory and owned by the job’s user.

Mother superior could create a specially named hidden file and sisters could check for its presence if the directory exists. That would, I think, be portable since you directly check for the presence of a shared file.

But I’m not sure whether I like this implicit treatment better than something explicitly configured as shared…

I prefer the explicit ‘shared’ option in $jobdir_root. Otherwise, primary mom would need to come up with a unique name for a file that starts with ‘.’ (hidden), and it must be owned by root, so user can’t delete it. Also, it’s kind of strange for a user’s execution directory owned by user, containing a hidden, root-owned file. What if some user really happens to create the same file and gets an error…

Updated the title to fix the typo “node” to “not”

Thanks for the update.

I’ve updated the design to handle the case where if a site sets up user’s home directories to be in a shared location, then provide a way also for sister moms to not delete job files under that shared location, when pbs_release_nodes is called:

Option for sister moms to not delete jobs (v9)

`

It looks good to me.

Thank you @bayucan! My only suggestion is to quote “<default>” in the EDD so it is explicit that the “<” and “>” are part of the new special token.

I’ve updated the design doc to quote <default>. Thanks.

Seeing an example in the pull request made me realize that adding the < > (angle brackets) as part of the special key " <default>" is confusing.

It might be difficult and confusing to document. Especially since angle brackets are used elsewhere to mean “fill in your own information here”.
For example:

$jobdir_root <stage directory root> shared

And here’s what the suggested entry in mom_priv look like for when the jobdir_root is the user’s home directory:

   $jobdir_root <default> shared

And in this case literally the angle brackets and the word default are what should be typed.
I believe admins might think the entry is supposed to have angle brackets:

  $jobdir_root </tmp/foo> shared

When in reality no angle brackets should end up in the mom_priv entry when providing a non-home directory path:

  $jobdir_root /tmp/foo shared

Good point, @lisa-altair . What if I just drop the angled brackets part so that it would be:
$jobdir_root default shared

So that ‘default’ is the special keyword by itself. Anyway, the <stage directory root> location to be recognized by pbs_mom shoulder be in full path (starts with /) so this keyword will not conflict. @scc : are you okay with this, or do you have another suggestion for the keyword name.

Hi Al, I had thought from our off-list conversation last week that you were concerned about implementing anything that is otherwise an allowable value today, which is where we settled on the “<default>” value (in today’s implementation a value of “default” would make jobdir_root $PBS_HOME/mom_priv/default).

If we are going to special case an otherwise allowable value I’d slightly prefer it to be a bit more specific/“strange”, like “PBS_DEFAULT” (which is itself not 100% confusion-free, since that is also the name of an unrelated environment variable in other parts of PBS…). Something like “PBS_USER_HOME” is longer, but does not have the same potential confusion problem. In the end, plain “default” is OK with me as well.

@scc : During our discussion, I thought “<default>” would be a good value, but as Lisa pointed out, I can really see the confusion that might ensue in our documentation. I’ll update the design to use plain “default” as keyword. It would be strange for a site to create $jobdir_root relative to $PBS_HOME/mom_priv as that directory is restricted (only readable by root) so that would cause a problem if parent directory of user execution directory. If a site wants a “$PBS_HOME/mom_priv/default” as $jobdir_root, then they’ll have to specify the full path.

Fine with me, thanks!