Hi there,
I have a user application which sometimes get deadlock and don’t respond to sigterm. Only sigkill (kill -9) can stop it from running. However, after “qdel” PBS simply thinks the job has quit and assign new jobs onto the node, effectively oversubscribed the node.
Is there a reliable way to kill the job? Or do I need to attach an execjob_end hook? I see there is pbs.pid attribute but don’t know what it contains.
Regards,
Chen