Job terminates when execjob_begin and execjob_prologue hook is run in batch but not when run interactively

I have written a hook to setup a scratch file system using beeond (BeeGFS on Demand). In the execjob_begin part of the hook I make sure that the directory that beeond will use for storage is setup on the local SSD disk. Then in the execjob_prologue I start beeond. If I run an interactive job, the beeond directory is mounted and works as expected. If I run the same request in batch the job terminates before it run my script.

In both cases this is the command I am running to start beeond.

12/06/2018 18:23:03;0800;pbs_python;Hook;pbs_python;cmd: beeond start -n /tmp/1273.ip-0A0C1004.beeond -r -d /mnt/pbs_ramdisk -c /mnt/beeond -f /etc/beeond’

The 1273.ip* file I create in the hook that has each node associated to the job listed one per line.

This is what I see in the logs for a batch job.

12/06/2018 17:44:03;0400;pbs_python;Svr;pbs_python;–> Stopping Python interpreter <–
12/06/2018 17:44:03;0400;pbs_mom;Hook;beeond;finished
12/06/2018 17:44:03;0800;pbs_mom;n/a;mom_get_sample;nprocs: 317, cantstat: 2, nomem: 0, skipped: 0, cached: 0
12/06/2018 17:44:03;0008;pbs_mom;Job;1268.ip-0A0C1004;Started, pid = 85978
12/06/2018 17:44:03;0800;pbs_mom;n/a;mom_get_sample;nprocs: 315, cantstat: 0, nomem: 0, skipped: 0, cached: 0
12/06/2018 17:44:03;0080;pbs_mom;Job;1268.ip-0A0C1004;task 00000001 terminated
12/06/2018 17:44:03;0800;pbs_mom;n/a;mom_get_sample;nprocs: 314, cantstat: 0, nomem: 0, skipped: 0, cached: 0
12/06/2018 17:44:03;0008;pbs_mom;Job;1268.ip-0A0C1004;Terminated
12/06/2018 17:44:03;0100;pbs_mom;Job;1268.ip-0A0C1004;task 00000001 cput= 0:00:00
12/06/2018 17:44:03;0008;pbs_mom;Job;1268.ip-0A0C1004;kill_job

When I run the same request in interactive mode I see this in the logs.

12/06/2018 18:23:34;0400;pbs_python;Svr;pbs_python;–> Stopping Python interpreter <–
12/06/2018 18:23:34;0400;pbs_mom;Hook;beeond;finished
12/06/2018 18:23:34;0800;pbs_mom;n/a;mom_get_sample;nprocs: 328, cantstat: 0, nomem: 0, skipped: 0, cached: 0
12/06/2018 18:23:34;0008;pbs_mom;Job;1273.ip-0A0C1004;Started, pid = 93525

If I comment out the beeond start command then batch jobs starts as expect but without the beeond file system.

Any idea why PBS would terminate the job in batch mode but work as expected in interactive mode?

Jon

1 Like

Hi Jon,

What is job script actually doing? From the logs you provided it appears that the script starts and then exits almost immediately. What happens if you submit a job script that does nothing more than print the date, sleep a few seconds, and print the date again? Does that also exit immediately?

Thanks,

Mike

Originally the job script would take about 5 minutes to run. However, I paired it back to just “date” to make sure that there was not an issue in the script. I then look at the size of the stdout file to see if the script ran. If I comment out the one line that runs the beeond command in the hook then I see the date command output. My theory is that something in the beeond script, run in the execjob_prologue event, is signaling PBS that the job script has completed when in reality it has not even started. I looked at the return code of the beeond script and it only returns 0 and when I look at the output from the script in the PBS_HOME/mom_priv/hooks/tmp dir it says that the hook completed successfully. For clarity, I am running version 18.1.3

Hi @jon, is the hook running as root or as the job owner? Just a wild guess, but if it is running as the job owner then there may be systemd process/IPC cleanup at play (maybe beeond backgrounds itself and then exits after the hook exits, which causes systemd to kneecap all of the user’s other processes, including the job which has since started)?

@scc, thanks for suggestion. All of the hook events are running as root and the cgroup hook is not enabled. So I don’t believe that this is the case. I think it may have something to do with the prologue event and the beeond start script launching of processes. I did some additional digging and some assumptions that I made about having to create directories before launching beeond were incorrect. The good news is that I am able to create beeond in the execjob_begin hook. The bad news is that something in the execjob_prologue hook when beeond script is run is causing PBS to not run the job in batch but works fine when the job is run interactively.

A ticket has been filed: https://pbspro.atlassian.net/browse/PP-1322

I see you already have a try/except block that prints a traceback in your hook. I was going to suggest that.

Hi @jon, getting back to this thread after some time… We now know that PBS starts both the execjob_prologue and execjob_launch (but not execjob_begin) hooks with stdin pipe connected to them. If any component of the hook script opens the stdin stream then it drains out the path to the job script from that pipe and the job starts without this information and just receives a Ctrl-D which exits the job with 0 exit status.

One example of this is using the following code in such a hook:

cmd = ['/bin/sudo', '-u', jobuser, '/bin/ssh', '-oStrictHostKeyChecking=no', node, '/bin/hostname', '2>/dev/null'] 
process = subprocess.Popen(cmd, shell=False,
                           stdout=subprocess.PIPE,
                           stderr=subprocess.PIPE)
hostname, err = process.communicate()

By default ssh opens stdin for the case where it may require a password and this causes the above behavior. Any code which does the same will exhibit the same behavior.

For the ssh case one workaround is to start the ssh with the -n argument that tells connects stdin to /dev/null and avoids the issue. e.g.

    cmd = ['/bin/sudo', '-u', jobuser, '/bin/ssh', '-n', '-oStrictHostKeyChecking=no', node, '/bin/hostname', '2>/dev/null'] 
    process = subprocess.Popen(cmd, shell=False, 
                               stdout=subprocess.PIPE, 
                               stderr=subprocess.PIPE) 
    hostname, err = process.communicate() 

I realize you are not using ssh in the script you shared, and I am not sure which command in your script may be connecting to stdin. A more general workaround to try is to use the following:

try: 
    os.close(0) 
except: 
    pass 

This should be at the start of a python hook to close stdin before running the body of the script which should be robust for general cases.

Thanks to @alexis.cousein and Ian Littlewood for figuring this out!

2 Likes