I would like to add a torque-like feature to PBS. Currently, if a job disappears from mom, the server is not able to get to know the job is gone. The server considers the job running indefinitely. The reason for the job to disappear can be miscellaneous… usually, node malfunction… disk failure, which results in node reinstall. An admin needs to delete the jobs manually now.
I would like to add polling the jobs on mom. Configurable on the server side. Disabled by default.
I did some searching and I can see some residual of this feature in openpbs (stat_to_mom() is used in older torque version to poll the jobs):
$ grep stat_to_mom * -R src/include/svrfunc.h:extern int stat_to_mom(job *, struct stat_cntl *);
Seems like the feature already was in pbs but it was removed a long time ago. Why? Is there any good reason not to have it?
Does it make sense?