PP-516: Direct write of job's stdout/err files

In Interface 2, it says: When sandbox is used with -k option, it will
delete .o and .e files from hosts. Going forward, deletion will
happen only by using “-R” option, not by “-k” option as It is not
expected to delete using “-k” option when it stands for “Keep_Files”.

I believe that what has been confusing everyone is that the original
coders chose a confusing name when they used “Keep_Files”. The -k
option is not about preserving files, it is instead about preventing
files from following you home.

I have a question about the design doc:

  • The 4th example says: qsub -Roe -koe
    Means direct write both files to user’s local home directory (does
    not matter if it is usecp-able in this case, this is existing -koe
    functionality), then remove both files upon successful job
    completion.

    Why direct write? I don’t see the “d” option.

Hello @agurban, thanks for your comments. I agree with you on the old meaning of “-k” (see my P.S. from a few messages up, I think we are saying the same thing).

As for example 4, the term “direct write” is confusing in that example, but I believe it is meant to distinguish the expected behavior from files being spooled. I think the word “direct” does not add anything in that example and should be removed (similar to the 6th example where it is absent under the same circumstance).

Comment removed, what is already in the EDD is fine, we don’t need to add that existing -k behavior causes -o and/or -e to be ignored.

Hi @nithinj,

Using “-Wsandbox=PRIVATE -Roe” would result in different behavior from the currently documented “-Wsandbox=PRIVATE -koe” if the job fails, since in the -Roe case we’d only remove the file(s) if the job succeeded.

I guess I just don’t see a real reason to bother changing the currently documented “-Wsandbox=PRIVATE -koe” behavior, because as it provides behavior (albeit not commonly used) that is not exactly matched by any new functionality, and is consistent with the various definitions of the attributes involved. Does getting rid of the current behavior make implementing and/or maintaining this new feature substantially easier?

Thanks @agurban and @scc for your response.

I’ve removed “direct” from the EDD[quote=“scc, post:25, topic:544”]
Does getting rid of the current behavior make implementing and/or maintaining this new feature substantially easier?
[/quote]

No. The whole point of removing the -k usage with Wsandbox was to get rid of the confusion. It can also be cleared by disassociating the word “Keep_Files” from k. Let’s not change the existing functionality with this feature. I’ve updated the EDD with this. Thanks!

It seems there are two PP-516 design pages in the Project Documentation space:

If they are both part of the enhancement that is being discussed here, I suggest combining the designs onto a single page in the Project Documentation space. If they are different, then I suggest changing one of the titles on the Project Documentation space so it doesn’t refer to PP-516.

I’ve remove the old obsolete page. Thanks for pointing out.

@scc recommened I repost some comments that I stuck in Jira over here so I’m doing so:

  1. We have jobs that qsub -o /dev/null -e /dev/null that end up generating cp error messages with cp -p cannot update the modification time of the /dev/null device on the mom node.
  2. If I add up the size of $PBS_HOME/undelivered across all our nodes it’s almost 3GB of disk space.
  3. We also have over 7500 “Unable to copy file messages” in our mom_logs. A lot of this is because of short jobs that rm -rf the directory where the logs are told to go and since the rm happens asynchronously wrt to the copy back of the logs… and so the rm happens before the copy.

A lot of these issues would just “go away” if the output was spooled directly into the final location. Hoping some form of this is delivered in 17.

@arwild01,

Many thanks for bringing up these scenarios.
Case1 should be possible once direct_write is in place. You can submit the job using qsub -koed -o /dev/null -e /dev/null along with an entry in the corresponding mom config file which will look like
$usecp $SERVERNAME:/dev/null /dev/null.
Then, the stdout/err will get directly written into the /dev/null in the mom side.
Case 2, 3 can also be avoided by enabling the direct_write feature. You can also enable this option at the site level using default_qsub_arguments.

This feature is targeted for 17, and the development is progressing as planned.
Thanks again!

The earlier implementation of keep_files puts more restrictions on the way to use sub arguments.
The options can be o/e/oe/eo along with -k.

When we add -d also with -k, I would prefer to have a more lenient way of usage.
The user can have any possible combinations of (oedn)* with only exception that n cannot be used with o/e.
So the following usages are valid.

  1. -kode
    oe doen’t have to be strictly used together.
  2. koded
    Multiple occurrence of the sub-argument will not result in an illegal operation error as there is no violation of the rule.

Please share your thoughts on this.