Hello,
is ther any way to give a feedback to the user when a execjob_prologue hook fails or rejects the job? It would be good to write something into the comment or into the output files of the job
Consider the following job:
$ cat myjob
echo 'Job writing to stdout'
echo 'Job writing to stderr' >&2
And the following hook:
import pbs
def write_to_stderr(job, msg):
"""
Write a message to the job stderr file
"""
try:
filename = job.stderr_file()
if not filename:
return
with open(filename, 'a') as desc:
desc.write(msg)
except Exception:
pass
def write_to_stdout(job, msg):
"""
Write a message to the job stdout file
"""
try:
filename = job.stdout_file()
if not filename:
return
with open(filename, 'a') as desc:
desc.write(msg)
except Exception:
pass
event = pbs.event()
if event.type == pbs.EXECJOB_PROLOGUE:
myevent = 'Execjob prologue'
elif event.type == pbs.EXECJOB_EPILOGUE:
myevent = 'Execjob epologue'
else:
event.reject('Hook event not supported.')
write_to_stdout(event.job, '%s hook event writing to stdout.\n' % myevent)
write_to_stderr(event.job, '%s hook event writing to stderr.\n' % myevent)
With the following hook configuration:
# qmgr -c 'l h'
Hook foo
type = site
enabled = true
event = execjob_prologue,execjob_epilogue
user = pbsadmin
alarm = 30
order = 1
debug = false
fail_action = none
Submitting the job produces the following:
$ qsub myjob
8.pbs-server
$ cat *.o8
Execjob prologue hook event writing to stdout.
Job writing to stdout
Execjob epologue hook event writing to stdout.
$ cat *.e8
Execjob prologue hook event writing to stderr.
Job writing to stderr
Execjob epologue hook event writing to stderr.
Hope that helps!
Hi @mkaro,
thanks for this example. It works good for interactive Jobs. I see the message that the prologue had an error and the job is finished.
But batch-jobs are requeued until the hold limit is reached. No output is written to the error file
The hooks looks like this
def main():
event = pbs.event()
if hasattr(event, 'job'):
job = event.job
exit_stat = run("/opt/check_permissions/pro "+str(job.id))
if exit_stat[2] != 0:
write_to_stderr(job, "Prologue failed with code (%d): %s" % (exit_stat[2],exit_stat[0].split(";")[1]))
event.reject("Prologue exited abnormally with return code %d (%s)" % (exit_stat[2],exit_stat[0].split(";")[1]))
#pbs.logmsg(pbs.EVENT_DEBUG, "Accept job")
# by default just accept the f*$%ing job
event.accept()
In the mom log i see this:
05/28/2019 09:44:45;0400;pbs_mom;Hook;myJamPrologue;finished
05/28/2019 09:44:45;0100;pbs_mom;Hook;myJamPrologue;execjob_prologue request rejected by ‘myJamPrologue’
05/28/2019 09:44:45;0008;pbs_mom;Job;4144563.hpc-batch14;Prologue exited abnormally with return code 1 (Project ‘BenchMarkingA’ is not available)
05/28/2019 09:44:45;0001;pbs_mom;Job;4144563.hpc-batch14;job not started, Failure -16
After this it is tried to rerun the job
When event.reject() is called, the hook exits immediately and the job gets requeued. The server will attempt to run the job 20 times before it puts a hold on it. Without knowing how your run method is implemented, it looks as though the /opt/check_permissions/pro is encountering an error. If you want the job to run regardless of the exit status you may want to remove the call to event.reject.
The code for your run method looks like this:
def run(cmd, timeout_sec=10):
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
timer = Timer(timeout_sec, proc.kill)
try:
timer.start()
stdout, stderr = proc.communicate()
return (stdout.strip(),stderr.strip(),proc.returncode)
finally:
timer.cancel()
return ("","",1)
I think there is a difference in the behaviour of prologue scripts and prologue hooks.
We had a prologue script until now and when it exited without code 0 than the job was finished / killed before it started and something was written into the output / error files.
Now with a hook the job is requeued instead of finished and nothing is written into the error files.
Yes, the behavior of the script prologue and execjob_prologue hook event differ. If you want to delete the job from the prologue script to mimic the behavior of the prologue script, you may do so by calling the pbs.event().job.delete() method.
Ok, i already had this idea but one questions about it.
Is it possible to write into the error and output files at this point?